Tidyverse

Data “in the wild” is never ready for visualization. We can’t use all the beautiful plots that we learned in the previous chapter until we have “wrangled” the data into a convenient shape. In this chapter, we’ll introduce a series of functions from the tidyverse collection of packages which help with wrangling, and everything else we need to do to work with data. Such functions include:

  1. filter() a data frame’s existing rows to only pick out a subset of them. For example, the alaska_flights data frame.
  2. summarize() one or more of its columns/variables with a summary statistic. Examples of summary statistics include the median and interquartile range of temperatures as we saw in Section @ref(boxplots) on boxplots.
  3. group_by() its rows. In other words, assign different rows to be part of the same group. We can then combine group_by() with summarize() to report summary statistics for each group separately. For example, say you don’t want a single overall average departure delay dep_delay for all three origin airports combined, but rather three separate average departure delays, one computed for each of the three origin airports.
  4. mutate() its existing columns/variables to create new ones. For example, convert hourly temperature recordings from degrees Fahrenheit to degrees Celsius.
  5. arrange() its rows. For example, sort the rows of weather in ascending or descending order of temp.
  6. join() it with another data frame by matching along a “key” variable. In other words, merge these two data frames together.

Notice how we used computer_code font to describe the actions we want to take on our data frames. This is because the dplyr` package, one of the packages in the tidyverse, has intuitively verb-named functions that are easy to remember.

There is a further benefit to learning to use the dplyr package for data wrangling: its similarity to the database querying language SQL (pronounced “sequel” or spelled out as “S”, “Q”, “L”). SQL (which stands for “Structured Query Language”) is used to manage large databases quickly and efficiently and is widely used by many institutions with a lot of data. While SQL is a topic left for a book or a course on database management, keep in mind that once you learn dplyr, you can learn SQL easily.

Let’s load all the packages needed for this chapter (this assumes you’ve already installed them).

library(tidyverse)
library(nycflights13)

The pipe operator: %>%

Before we start data wrangling, let’s first introduce a nifty tool that gets loaded with the dplyr package: the pipe operator %>%. The pipe operator allows us to combine multiple operations in R into a single sequential chain of actions.

Let’s start with a hypothetical example. Say you would like to perform a hypothetical sequence of operations on a hypothetical data frame x using hypothetical functions f(), g(), and h():

  1. Take x then
  2. Use x as an input to a function f() then
  3. Use the output of f(x) as an input to a function g() then
  4. Use the output of g(f(x)) as an input to a function h()

One way to achieve this sequence of operations is by using nesting parentheses as follows:

h(g(f(x)))

This code isn’t so hard to read since we are applying only three functions: f(), then g(), then h() and each of the functions is short in its name. Further, each of these functions also only has one argument. However, you can imagine that this will get progressively harder to read as the number of functions applied in your sequence increases and the arguments in each function increase as well. This is where the pipe operator %>% comes in handy. %>% takes the output of one function and then “pipes” it to be the input of the next function. Furthermore, a helpful trick is to read %>% as “then” or “and then.” For example, you can obtain the same output as the hypothetical sequence of functions as follows:

x %>% 
  f() %>% 
  g() %>% 
  h()

You would read this sequence as:

  1. Take x then
  2. Use this output as the input to the next function f() then
  3. Use this output as the input to the next function g() then
  4. Use this output as the input to the next function h()

So while both approaches achieve the same goal, the latter is much more human-readable because you can clearly read the sequence of operations line-by-line. But what are the hypothetical x, f(), g(), and h()? Throughout this chapter on data wrangling:

  1. The starting value x will be a data frame. For example, the flights data frame we explored in Section @ref(nycflights13).
  2. The sequence of functions, here f(), g(), and h(), will mostly be a sequence of any number of the six data wrangling verb-named functions we listed in the introduction to this chapter. For example, the filter(carrier == "AS") function and argument specified we previewed earlier.
  3. The result will be the transformed/modified data frame that you want. In our example, we’ll save the result in a new data frame by using the <- assignment operator with the name alaska_flights via alaska_flights <-.
alaska_flights <- flights %>% 
  filter(carrier == "AS")

Much like when adding layers to a ggplot() using the + sign, you form a single chain of data wrangling operations by combining verb-named functions into a single sequence using the pipe operator %>%. Furthermore, much like how the + sign has to come at the end of lines when constructing plots, the pipe operator %>% has to come at the end of lines as well.

It is worth noting that most dplyr verbs, as well as most functions in the larger tidyverse, achieve this effect by always having their first argument be an input tibble. For example, look at ?filter to see that, for example, the first argument of filter() is a tibble named .data. So, we can rewrite the above code snippet as:

alaska_flights <- flights %>% 
  filter(.data = ., carrier == "AS")

The “.” serves as special role when using pipes. It represents the tibble which was “passed down” from the previous step in the pipe. Here, we are telling R that “.” — which is flights in this case — is the first argument to filter(). Since argument names do not have to be used, we can rewrite this as:

alaska_flights <- flights %>% 
  filter(., carrier == "AS")

You will almost never write code that looks like this, at least with simple dplyr verbs like filter(). But, behind the scenes, this is what is going on. And, in more advanced cases, we will need to use “.” to refer to the passed-in tibble.

Keep in mind, there are many more advanced data wrangling functions than just the six listed in the introduction to this chapter; you’ll see some examples of these in Section @ref(other-verbs). However, just with these six verb-named functions you’ll be able to perform a broad array of data wrangling tasks for the rest of this book.

filter rows

Diagram of filter() rows operation.

Diagram of filter() rows operation.

The filter() function here works much like the “Filter” option in Microsoft Excel; it allows you to specify criteria about the values of a variable in your dataset and then filters out only the rows that match that criteria.

We begin by focusing only on flights from New York City to Portland, Oregon. The dest destination code (or airport code) for Portland, Oregon is "PDX". Run the following and look at the results in RStudio’s spreadsheet viewer to ensure that only flights heading to Portland are chosen here:

portland_flights <- flights %>% 
  filter(dest == "PDX")
View(portland_flights)

Note the order of the code. First, take the flights data frame flights then filter() the data frame so that only those where the dest equals "PDX" are included. We test for equality using the double equal sign == and not a single equal sign =. In other words filter(dest = "PDX") will yield an error. This is a convention across many programming languages. If you are new to coding, you’ll probably forget to use the double equal sign == a few times before you get the hang of it.

You can use other operators beyond just the == operator that tests for equality:

  • > corresponds to “greater than”
  • < corresponds to “less than”
  • >= corresponds to “greater than or equal to”
  • <= corresponds to “less than or equal to”
  • != corresponds to “not equal to.” The ! is used in many programming languages to indicate “not.”

Furthermore, you can combine multiple criteria using operators that make comparisons:

  • | corresponds to “or”
  • & corresponds to “and”

To see many of these in action, let’s filter flights for all rows that departed from JFK and were heading to Burlington, Vermont ("BTV") or Seattle, Washington ("SEA") and departed in the months of October, November, or December. Run the following:

btv_sea_flights_fall <- flights %>% 
  filter(origin == "JFK" & (dest == "BTV" | dest == "SEA") & month >= 10)
View(btv_sea_flights_fall)

Note that even though colloquially speaking one might say “all flights leaving Burlington, Vermont and Seattle, Washington,” in terms of computer operations, we really mean “all flights leaving Burlington, Vermont or leaving Seattle, Washington.” For a given row in the data, dest can be "BTV", or "SEA", or something else, but not both "BTV" and "SEA" at the same time. Furthermore, note the careful use of parentheses around dest == "BTV" | dest == "SEA".

We can often skip the use of & and just separate our conditions with a comma. The previous code will return the identical output btv_sea_flights_fall as the following code:

btv_sea_flights_fall <- flights %>% 
  filter(origin == "JFK", (dest == "BTV" | dest == "SEA"), month >= 10)
View(btv_sea_flights_fall)

Let’s present another example that uses the ! “not” operator to pick rows that don’t match a criteria. As mentioned earlier, the ! can be read as “not.” Here we are filtering rows corresponding to flights that didn’t go to Burlington, VT or Seattle, WA.

not_BTV_SEA <- flights %>% 
  filter(!(dest == "BTV" | dest == "SEA"))
View(not_BTV_SEA)

Again, note the careful use of parentheses around the (dest == "BTV" | dest == "SEA"). If we didn’t use parentheses as follows:

flights %>% filter(!dest == "BTV" | dest == "SEA")

We would be returning all flights not headed to "BTV" or those headed to "SEA", which is an entirely different resulting data frame.

Now say we have a larger number of airports we want to filter for, say "SEA", "SFO", "PDX", "BTV", and "BDL". We could continue to use the | (or) operator:

many_airports <- flights %>% 
  filter(dest == "SEA" | dest == "SFO" | dest == "PDX" | 
         dest == "BTV" | dest == "BDL")

but as we progressively include more airports, this will get unwieldy to write. A slightly shorter approach uses the %in% operator along with the c() function. Recall from Subsection @ref(programming-concepts) that the c() function “combines” or “concatenates” values into a single vector of values.

many_airports <- flights %>% 
  filter(dest %in% c("SEA", "SFO", "PDX", "BTV", "BDL"))
View(many_airports)

What this code is doing is filtering flights for all flights where dest is in the vector of airports c("BTV", "SEA", "PDX", "SFO", "BDL"). Both outputs of many_airports are the same, but as you can see the latter takes much less energy to code. The %in% operator is useful for looking for matches commonly in one vector/variable compared to another.

As a final note, we recommend that filter() should often be among the first verbs you consider applying to your data. This cleans your dataset to only those rows you care about, or put differently, it narrows down the scope of your data frame to just the observations you care about.

summarize variables

The next common task when working with data frames is to compute summary statistics. Summary statistics are single numerical values that summarize a large number of values. Commonly known examples of summary statistics include the mean (also called the average) and the median (the middle value). Other examples of summary statistics that might not immediately come to mind include the sum, the smallest value also called the minimum, the largest value also called the maximum, and the standard deviation. See Appendix @ref(appendix-stat-terms) for a glossary of such summary statistics.

Let’s calculate two summary statistics of the temp temperature variable in the weather data frame: the mean and standard deviation (recall from Section @ref(nycflights13) that the weather data frame is included in the nycflights13 package). To compute these summary statistics, we need the mean() and sd() summary functions in R. Summary functions in R take in many values and return a single value, as illustrated in Figure @ref(fig:summary-function).

Diagram illustrating a summary function in R.

Diagram illustrating a summary function in R.

More precisely, we’ll use the mean() and sd() summary functions within the summarize() function from the dplyr package. Note you can also use the British English spelling of summarise(). As shown in Figure @ref(fig:sum1), the summarize() function takes in a data frame and returns a data frame with only one row corresponding to the summary statistics.

Diagram of summarize() rows.

Diagram of summarize() rows.

We’ll save the results in a new data frame called summary_temp that will have two columns/variables: the mean and the std_dev:

summary_temp <- weather %>% 
  summarize(mean = mean(temp), std_dev = sd(temp))
summary_temp
## # A tibble: 1 x 2
##    mean std_dev
##   <dbl>   <dbl>
## 1    NA      NA

Why are the values returned NA? As we saw in Subsection @ref(geompoint) when creating the scatterplot of departure and arrival delays for alaska_flights, NA is how R encodes missing values where NA indicates “not available” or “not applicable.” If a value for a particular row and a particular column does not exist, NA is stored instead. Values can be missing for many reasons. Perhaps the data was collected but someone forgot to enter it? Perhaps the data was not collected at all because it was too difficult to do so? Perhaps there was an erroneous value that someone entered that has been corrected to read as missing? You’ll often encounter issues with missing values when working with real data.

Going back to our summary_temp output, by default any time you try to calculate a summary statistic of a variable that has one or more NA missing values in R, NA is returned. To work around this fact, you can set the na.rm argument to TRUE, where rm is short for “remove”; this will ignore any NA missing values and only return the summary value for all non-missing values.

The code that follows computes the mean and standard deviation of all non-missing values of temp:

summary_temp <- weather %>% 
  summarize(mean = mean(temp, na.rm = TRUE), 
            std_dev = sd(temp, na.rm = TRUE))
summary_temp
## # A tibble: 1 x 2
##    mean std_dev
##   <dbl>   <dbl>
## 1  55.3    17.8

Notice how the na.rm = TRUE are used as arguments to the mean() and sd() summary functions individually, and not to the summarize() function.

However, one needs to be cautious whenever ignoring missing values as we’ve just done. There are possible ramifications of blindly sweeping rows with missing values “under the rug.” This is in fact why the na.rm argument to any summary statistic function in R is set to FALSE by default. In other words, R does not ignore rows with missing values by default. R is alerting you to the presence of missing data and you should be mindful of this missingness and any potential causes of this missingness throughout your analysis.

What are other summary functions we can use inside the summarize() verb to compute summary statistics? As seen in the diagram in Figure @ref(fig:summary-function), you can use any function in R that takes many values and returns just one. Here are just a few:

  • mean(): the average
  • sd(): the standard deviation, which is a measure of spread
  • min() and max(): the minimum and maximum values, respectively
  • IQR(): interquartile range
  • sum(): the total amount when adding multiple numbers
  • n(): a count of the number of rows in each group. This particular summary function will make more sense when group_by() is covered in Section @ref(groupby).

group_by rows

(ref:groupby) Diagram of group_by() and summarize().

(ref:groupby)

(ref:groupby)

Say instead of a single mean temperature for the whole year, you would like 12 mean temperatures, one for each of the 12 months separately. In other words, we would like to compute the mean temperature split by month. We can do this by “grouping” temperature observations by the values of another variable, in this case by the 12 values of the variable month. Run the following code:

summary_monthly_temp <- weather %>% 
  group_by(month) %>% 
  summarize(mean = mean(temp, na.rm = TRUE), 
            std_dev = sd(temp, na.rm = TRUE))
summary_monthly_temp
## # A tibble: 12 x 3
##    month  mean std_dev
##    <int> <dbl>   <dbl>
##  1     1  35.6   10.2 
##  2     2  34.3    6.98
##  3     3  39.9    6.25
##  4     4  51.7    8.79
##  5     5  61.8    9.68
##  6     6  72.2    7.55
##  7     7  80.1    7.12
##  8     8  74.5    5.19
##  9     9  67.4    8.47
## 10    10  60.1    8.85
## 11    11  45.0   10.4 
## 12    12  38.4    9.98

This code is identical to the previous code that created summary_temp, but with an extra group_by(month) added before the summarize(). Grouping the weather dataset by month and then applying the summarize() functions yields a data frame that displays the mean and standard deviation temperature split by the 12 months of the year.

It is important to note that the group_by() function doesn’t change data frames by itself. Rather it changes the meta-data, or data about the data, specifically the grouping structure. It is only after we apply the summarize() function that the data frame changes.

For example, let’s consider the diamonds data frame included in the ggplot2 package. Run this code:

diamonds
## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows

Observe that the first line of the output reads # A tibble: 53,940 x 10. This is an example of meta-data, in this case the number of observations/rows and variables/columns in diamonds. The actual data itself are the subsequent table of values. Now let’s pipe the diamonds data frame into group_by(cut):

diamonds %>% 
  group_by(cut)
## # A tibble: 53,940 x 10
## # Groups:   cut [5]
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows

Observe that now there is additional meta-data: # Groups: cut [5] indicating that the grouping structure meta-data has been set based on the 5 possible levels of the categorical variable cut: "Fair", "Good", "Very Good", "Premium", and "Ideal". On the other hand, observe that the data has not changed: it is still a table of 53,940 \(\times\) 10 values.

Only by combining a group_by() with another data wrangling operation, in this case summarize(), will the data actually be transformed.

diamonds %>% 
  group_by(cut) %>% 
  summarize(avg_price = mean(price))
## # A tibble: 5 x 2
##   cut       avg_price
##   <ord>         <dbl>
## 1 Fair          4359.
## 2 Good          3929.
## 3 Very Good     3982.
## 4 Premium       4584.
## 5 Ideal         3458.

If you would like to remove this grouping structure meta-data, we can pipe the resulting data frame into the ungroup() function:

diamonds %>% 
  group_by(cut) %>% 
  ungroup()
## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows

Observe how the # Groups: cut [5] meta-data is no longer present.

Let’s now revisit the n() counting summary function we briefly introduced previously. Recall that the n() function counts rows. This is opposed to the sum() summary function that returns the sum of a numerical variable. For example, suppose we’d like to count how many flights departed each of the three airports in New York City:

by_origin <- flights %>% 
  group_by(origin) %>% 
  summarize(count = n())
by_origin
## # A tibble: 3 x 2
##   origin  count
##   <chr>   <int>
## 1 EWR    120835
## 2 JFK    111279
## 3 LGA    104662

We see that Newark ("EWR") had the most flights departing in 2013 followed by "JFK" and lastly by LaGuardia ("LGA"). Note there is a subtle but important difference between sum() and n(); while sum() returns the sum of a numerical variable, n() returns a count of the number of rows/observations.

Grouping by more than one variable

You are not limited to grouping by one variable. Say you want to know the number of flights leaving each of the three New York City airports for each month. We can also group by a second variable month using group_by(origin, month):

by_origin_monthly <- flights %>% 
  group_by(origin, month) %>% 
  summarize(count = n())
by_origin_monthly
## # A tibble: 36 x 3
## # Groups:   origin [3]
##    origin month count
##    <chr>  <int> <int>
##  1 EWR        1  9893
##  2 EWR        2  9107
##  3 EWR        3 10420
##  4 EWR        4 10531
##  5 EWR        5 10592
##  6 EWR        6 10175
##  7 EWR        7 10475
##  8 EWR        8 10359
##  9 EWR        9  9550
## 10 EWR       10 10104
## # … with 26 more rows

Observe that there are 36 rows to by_origin_monthly because there are 12 months for 3 airports (EWR, JFK, and LGA).

Why do we group_by(origin, month) and not group_by(origin) and then group_by(month)? Let’s investigate:

by_origin_monthly_incorrect <- flights %>% 
  group_by(origin) %>% 
  group_by(month) %>% 
  summarize(count = n())
by_origin_monthly_incorrect
## # A tibble: 12 x 2
##    month count
##    <int> <int>
##  1     1 27004
##  2     2 24951
##  3     3 28834
##  4     4 28330
##  5     5 28796
##  6     6 28243
##  7     7 29425
##  8     8 29327
##  9     9 27574
## 10    10 28889
## 11    11 27268
## 12    12 28135

What happened here is that the second group_by(month) overwrote the grouping structure meta-data of the earlier group_by(origin), so that in the end we are only grouping by month. The lesson here is if you want to group_by() two or more variables, you should include all the variables at the same time in the same group_by() adding a comma between the variable names.

mutate existing variables

Diagram of mutate() columns.

Diagram of mutate() columns.

Another common transformation of data is to create/compute new variables based on existing ones. For example, say you are more comfortable thinking of temperature in degrees Celsius (°C) instead of degrees Fahrenheit (°F). The formula to convert temperatures from °F to °C is

\[ \text{temp in C} = \frac{\text{temp in F} - 32}{1.8} \]

We can apply this formula to the temp variable using the mutate() function from the dplyr package, which takes existing variables and mutates them to create new ones.

weather <- weather %>% 
  mutate(temp_in_C = (temp - 32) / 1.8)

In this code, we mutate() the weather data frame by creating a new variable temp_in_C = (temp - 32) / 1.8 and then overwrite the original weather data frame. Why did we overwrite the data frame weather, instead of assigning the result to a new data frame like weather_new? As a rough rule of thumb, as long as you are not losing original information that you might need later, it’s acceptable practice to overwrite existing data frames with updated ones, as we did here. On the other hand, why did we not overwrite the variable temp, but instead created a new variable called temp_in_C? Because if we did this, we would have erased the original information contained in temp of temperatures in Fahrenheit that may still be valuable to us.

Let’s now compute monthly average temperatures in both °F and °C using the group_by() and summarize() code we saw in Section @ref(groupby):

summary_monthly_temp <- weather %>% 
  group_by(month) %>% 
  summarize(mean_temp_in_F = mean(temp, na.rm = TRUE), 
            mean_temp_in_C = mean(temp_in_C, na.rm = TRUE))
summary_monthly_temp
## # A tibble: 12 x 3
##    month mean_temp_in_F mean_temp_in_C
##    <int>          <dbl>          <dbl>
##  1     1           35.6           2.02
##  2     2           34.3           1.26
##  3     3           39.9           4.38
##  4     4           51.7          11.0 
##  5     5           61.8          16.6 
##  6     6           72.2          22.3 
##  7     7           80.1          26.7 
##  8     8           74.5          23.6 
##  9     9           67.4          19.7 
## 10    10           60.1          15.6 
## 11    11           45.0           7.22
## 12    12           38.4           3.58

Let’s consider another example. Passengers are often frustrated when their flight departs late, but aren’t as annoyed if, in the end, pilots can make up some time during the flight. This is known in the airline industry as gain, and we will create this variable using the mutate() function:

flights <- flights %>% 
  mutate(gain = dep_delay - arr_delay)

Let’s take a look at only the dep_delay, arr_delay, and the resulting gain variables for the first 5 rows in our updated flights data frame in Table @ref(tab:first-five-flights).

First five rows of departure/arrival delay and gain variables
dep_delay arr_delay gain
2 11 -9
4 20 -16
2 33 -31
-1 -18 17
-6 -25 19

The flight in the first row departed 2 minutes late but arrived 11 minutes late, so its “gained time in the air” is a loss of 9 minutes, hence its gain is 2 - 11 = -9. On the other hand, the flight in the fourth row departed a minute early (dep_delay of -1) but arrived 18 minutes early (arr_delay of -18), so its “gained time in the air” is \(-1 - (-18) = -1 + 18 = 17\) minutes, hence its gain is +17.

Let’s look at some summary statistics of the gain variable by considering multiple summary functions at once in the same summarize() code:

gain_summary <- flights %>% 
  summarize(
    min = min(gain, na.rm = TRUE),
    q1 = quantile(gain, 0.25, na.rm = TRUE),
    median = quantile(gain, 0.5, na.rm = TRUE),
    q3 = quantile(gain, 0.75, na.rm = TRUE),
    max = max(gain, na.rm = TRUE),
    mean = mean(gain, na.rm = TRUE),
    sd = sd(gain, na.rm = TRUE),
    missing = sum(is.na(gain))
  )
gain_summary
## # A tibble: 1 x 8
##     min    q1 median    q3   max  mean    sd missing
##   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>   <int>
## 1  -196    -3      7    17   109  5.66  18.0    9430

We see for example that the average gain is +5 minutes, while the largest is +109 minutes! However, this code would take some time to type out in practice. We’ll see later on in Subsection @ref(model1EDA) that there is a much more succinct way to compute a variety of common summary statistics: using the skim() function from the skimr package.

Recall from Section @ref(histograms) that since gain is a numerical variable, we can visualize its distribution using a histogram.

ggplot(data = flights, mapping = aes(x = gain)) +
  geom_histogram(color = "white", bins = 20)
## Warning: Removed 9430 rows containing non-finite values (stat_bin).
Histogram of gain variable.

Histogram of gain variable.

The resulting histogram in Figure @ref(fig:gain-hist) provides a different perspective on the gain variable than the summary statistics we computed earlier. For example, note that most values of gain are right around 0.

To close out our discussion on the mutate() function to create new variables, note that we can create multiple new variables at once in the same mutate() code. Furthermore, within the same mutate() code we can refer to new variables we just created. As an example, consider the mutate() code Hadley Wickham and Garrett Grolemund show in Chapter 5 of R for Data Science [@rds2016]:

flights <- flights %>% 
  mutate(
    gain = dep_delay - arr_delay,
    hours = air_time / 60,
    gain_per_hour = gain / hours
  )

arrange and sort rows

One of the most commonly performed data wrangling tasks is to sort a data frame’s rows in the alphanumeric order of one of the variables. The dplyr package’s arrange() function allows us to sort/reorder a data frame’s rows according to the values of the specified variable.

Suppose we are interested in determining the most frequent destination airports for all domestic flights departing from New York City in 2013:

freq_dest <- flights %>% 
  group_by(dest) %>% 
  summarize(num_flights = n())
freq_dest
## # A tibble: 105 x 2
##    dest  num_flights
##    <chr>       <int>
##  1 ABQ           254
##  2 ACK           265
##  3 ALB           439
##  4 ANC             8
##  5 ATL         17215
##  6 AUS          2439
##  7 AVL           275
##  8 BDL           443
##  9 BGR           375
## 10 BHM           297
## # … with 95 more rows

Observe that by default the rows of the resulting freq_dest data frame are sorted in alphabetical order of destination. Say instead we would like to see the same data, but sorted from the most to the least number of flights (num_flights) instead:

freq_dest %>% 
  arrange(num_flights)
## # A tibble: 105 x 2
##    dest  num_flights
##    <chr>       <int>
##  1 LEX             1
##  2 LGA             1
##  3 ANC             8
##  4 SBN            10
##  5 HDN            15
##  6 MTJ            15
##  7 EYW            17
##  8 PSP            19
##  9 JAC            25
## 10 BZN            36
## # … with 95 more rows

This is, however, the opposite of what we want. The rows are sorted with the least frequent destination airports displayed first. This is because arrange() always returns rows sorted in ascending order by default. To switch the ordering to be in “descending” order instead, we use the desc() function as so:

freq_dest %>% 
  arrange(desc(num_flights))
## # A tibble: 105 x 2
##    dest  num_flights
##    <chr>       <int>
##  1 ORD         17283
##  2 ATL         17215
##  3 LAX         16174
##  4 BOS         15508
##  5 MCO         14082
##  6 CLT         14064
##  7 SFO         13331
##  8 FLL         12055
##  9 MIA         11728
## 10 DCA          9705
## # … with 95 more rows

Factors

We’ve spent a lot of time working with big, beautiful data frames, like the Gapminder data. But we also need to manage the individual variables housed within.

Factors are the variable type that useRs love to hate. It is how we store truly categorical information in R. The values a factor can take on are called the levels. For example, the levels of the factor continent in Gapminder are are “Africa”, “Americas”, etc. and this is what’s usually presented to your eyeballs by R. In general, the levels are friendly human-readable character strings, like “male/female” and “control/treated”. But never ever ever forget that, under the hood, R is really storing integer codes 1, 2, 3, etc.

This [Janus][wiki-janus]-like nature of factors means they are rich with booby traps for the unsuspecting but they are a necessary evil. I recommend you learn how to be the boss of your factors. The pros far outweigh the cons. Specifically in modelling and figure-making, factors are anticipated and accommodated by the functions and packages you will want to exploit.

The worst kind of factor is the stealth factor. The variable that you think of as character, but that is actually a factor (numeric!!). This is a classic R gotcha. Check your variable types explicitly when things seem weird. It happens to the best of us.

Where do stealth factors come from? Base R has a burning desire to turn character information into factor. The happens most commonly at data import via read.table() and friends. But data.frame() and other functions are also eager to convert character to factor. To shut this down, use stringsAsFactors = FALSE in read.table() and data.frame() or – even better – use the tidyverse! For data import, use readr::read_csv(), readr::read_tsv(), etc. For data frame creation, use tibble::tibble(). And so on.

Good articles about how the factor fiasco came to be:

  • [stringsAsFactors: An unauthorized biography][bio-strings-as-factors] by Roger Peng
  • [stringsAsFactors = <sigh>][blog-strings-as-factors] by Thomas Lumley

The forcats package

[forcats][forcats-web] is a core package in the tidyverse. It is installed via install.packages("tidyverse"), and loaded with library(tidyverse). You can also install via install.packages("forcats")and load it yourself separately as needed via library(forcats). Main functions start with fct_. There really is no coherent family of base functions that forcats replaces – that’s why it’s such a welcome addition.

Load tidyverse (which include forcats) and gapminder.

library(tidyverse)
library(gapminder)

Get to know your factor before you start touching it! It’s polite. Let’s use gapminder$continent as our example.

str(gapminder$continent)
##  Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
levels(gapminder$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
nlevels(gapminder$continent)
## [1] 5
class(gapminder$continent)
## [1] "factor"

To get a frequency table as a tibble, from a tibble, use dplyr::count(). To get similar from a free-range factor, use forcats::fct_count().

gapminder %>% 
  count(continent)
## # A tibble: 5 x 2
##   continent     n
##   <fct>     <int>
## 1 Africa      624
## 2 Americas    300
## 3 Asia        396
## 4 Europe      360
## 5 Oceania      24
fct_count(gapminder$continent)
## # A tibble: 5 x 2
##   f            n
##   <fct>    <int>
## 1 Africa     624
## 2 Americas   300
## 3 Asia       396
## 4 Europe     360
## 5 Oceania     24

Dropping unused levels

Just because you drop all the rows corresponding to a specific factor level, the levels of the factor itself do not change. Sometimes all these unused levels can come back to haunt you later, e.g., in figure legends.

Watch what happens to the levels of country (= nothing) when we filter Gapminder to a handful of countries.

nlevels(gapminder$country)
## [1] 142
h_countries <- c("Egypt", "Haiti", "Romania", "Thailand", "Venezuela")
h_gap <- gapminder %>%
  filter(country %in% h_countries)
nlevels(h_gap$country)
## [1] 142

Even though h_gap only has data for a handful of countries, we are still schlepping around all 142 levels from the original gapminder tibble.

How can you get rid of them? The base function droplevels() operates on all the factors in a data frame or on a single factor. The function forcats::fct_drop() operates on a factor.

h_gap_dropped <- h_gap %>% 
  droplevels()
nlevels(h_gap_dropped$country)
## [1] 5
# Use forcats::fct_drop() on a free-range factor

h_gap$country %>%
  fct_drop() %>%
  levels()
## [1] "Egypt"     "Haiti"     "Romania"   "Thailand"  "Venezuela"

Change order of the levels, principled

By default, factor levels are ordered alphabetically. Which might as well be random, when you think about it! It is preferable to order the levels according to some principle:

  • Frequency. Make the most common level the first and so on.
  • Another variable. Order factor levels according to a summary statistic for another variable. Example: order Gapminder countries by life expectancy.

First, let’s order continent by frequency, forwards and backwards. This is often a great idea for tables and figures, esp. frequency barplots.

# Default order is alphabetical

gapminder$continent %>%
  levels()
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
# Order by frequency

gapminder$continent %>% 
  fct_infreq() %>%
  levels()
## [1] "Africa"   "Asia"     "Europe"   "Americas" "Oceania"
# Backwards!

gapminder$continent %>% 
  fct_infreq() %>%
  fct_rev() %>% 
  levels()
## [1] "Oceania"  "Americas" "Europe"   "Asia"     "Africa"

These two barcharts of frequency by continent differ only in the order of the continents. Which do you prefer?

Now we order country by another variable, forwards and backwards. This other variable is usually quantitative and you will order the factor according to a grouped summary. The factor is the grouping variable and the default summarizing function is median() but you can specify something else.

# Order countries by median life expectancy

fct_reorder(gapminder$country, gapminder$lifeExp) %>% 
  levels() %>% head()
## [1] "Sierra Leone"  "Guinea-Bissau" "Afghanistan"   "Angola"       
## [5] "Somalia"       "Guinea"
# Order accoring to minimum life exp instead of median

fct_reorder(gapminder$country, gapminder$lifeExp, min) %>% 
  levels() %>% head()
## [1] "Rwanda"       "Afghanistan"  "Gambia"       "Angola"       "Sierra Leone"
## [6] "Cambodia"
# Backwards!

fct_reorder(gapminder$country, gapminder$lifeExp, .desc = TRUE) %>% 
  levels() %>% head()
## [1] "Iceland"     "Japan"       "Sweden"      "Switzerland" "Netherlands"
## [6] "Norway"

Example of why we reorder factor levels: often makes plots much better! When a factor is mapped to x or y, it should almost always be reordered by the quantitative variable you are mapping to the other one.

Compare the interpretability of these two plots of life expectancy in Asian countries in 2007. The only difference is the order of the country factor. Which one do you find easier to learn from?

gap_asia_2007 <- gapminder %>% filter(year == 2007, continent == "Asia")
ggplot(gap_asia_2007, aes(x = lifeExp, y = country)) + geom_point()
ggplot(gap_asia_2007, aes(x = lifeExp, y = fct_reorder(country, lifeExp))) +
  geom_point()

Use fct_reorder2() when you have a line chart of a quantitative x against another quantitative y and your factor provides the color. This way the legend appears in some order as the data! Contrast the legend on the left with the one on the right.

h_countries <- c("Egypt", "Haiti", "Romania", "Thailand", "Venezuela")
h_gap <- gapminder %>%
  filter(country %in% h_countries) %>% 
  droplevels()
ggplot(h_gap, aes(x = year, y = lifeExp, color = country)) +
  geom_line()
ggplot(h_gap, aes(x = year, y = lifeExp,
                  color = fct_reorder2(country, year, lifeExp))) +
  geom_line() +
  labs(color = "country")

Change order of the levels, “because I said so”

Sometimes you just want to hoist one or more levels to the front. Why? Because I said so. This resembles what we do when we move variables to the front with dplyr::select(special_var, everything()).

h_gap$country %>% levels()
## [1] "Egypt"     "Haiti"     "Romania"   "Thailand"  "Venezuela"
h_gap$country %>% fct_relevel("Romania", "Haiti") %>% levels()
## [1] "Romania"   "Haiti"     "Egypt"     "Thailand"  "Venezuela"

This might be useful if you are preparing a report for, say, the Romanian government. The reason for always putting Romania first has nothing to do with the data, it is important for external reasons and you need a way to express this.

Recode the levels

Sometimes you have better ideas about what certain levels should be. This is called recoding.

i_gap <- gapminder %>% 
  filter(country %in% c("United States", "Sweden", "Australia")) %>% 
  droplevels()
i_gap$country %>% levels()
## [1] "Australia"     "Sweden"        "United States"
i_gap$country %>%
  fct_recode("USA" = "United States", "Oz" = "Australia") %>% levels()
## [1] "Oz"     "Sweden" "USA"

Grow a factor

Let’s create two data frames, each with data from two countries, dropping unused factor levels.

df1 <- gapminder %>%
  filter(country %in% c("United States", "Mexico"), year > 2000) %>%
  droplevels()
df2 <- gapminder %>%
  filter(country %in% c("France", "Germany"), year > 2000) %>%
  droplevels()

The country factors in df1 and df2 have different levels.

levels(df1$country)
## [1] "Mexico"        "United States"
levels(df2$country)
## [1] "France"  "Germany"

Can you just combine them?

c(df1$country, df2$country)
## [1] 1 1 2 2 1 1 2 2

Umm, no. That is wrong on many levels! Use fct_c() to do this.

fct_c(df1$country, df2$country)
## [1] Mexico        Mexico        United States United States France       
## [6] France        Germany       Germany      
## Levels: Mexico United States France Germany

Character Vectors

We’ve spent a lot of time working with big, beautiful data frames. That are clean and wholesome, like the Gapminder data.

But real life will be much nastier. You will bring data into R from the outside world and discover there are problems. You might think: how hard can it be to deal with character data? And the answer is: it can be very hard!

  • [Stack Exchange outage][stackexchange-outage]
  • [Regexes to validate/match email addresses][email-regex]
  • [Fixing an Atom bug][fix-atom-bug]

Here we discuss common remedial tasks for cleaning and transforming character data, also known as “strings”. A data frame or tibble will consist of one or more atomic vectors of a certain class. This lesson deals with things you can do with vectors of class character.

Here are some resources:

Manipulating character vectors

  • [stringr package][stringr-web].
    • A core package in the tidyverse. It is installed via install.packages("tidyverse") and also loaded via library(tidyverse). Of course, you can also install or load it individually.
    • Main functions start with str_. Auto-complete is your friend.
    • Replacements for base functions re: string manipulation and regular expressions (see below).
    • Main advantages over base functions: greater consistency about inputs and outputs. Outputs are more ready for your next analytical task.
  • [tidyr package][tidyr-web].
    • Especially useful for functions that split one character vector into many and vice versa: separate(), unite(), extract().
  • The [glue package][glue-web] is fantastic for string interpolation. If stringr::str_interp() doesn’t get your job done, check out the glue package.

Regular expressions resources

A God-awful and powerful language for expressing patterns to match in text or for search-and-replace. Frequently described as “write only”, because regular expressions are easier to write than to read/understand. And they are not particularly easy to write.

  • We again prefer the [stringr package][stringr-cran] over base functions. Why?
    • Wraps [stringi][stringi-cran], which is a great place to look if stringr isn’t powerful enough.
    • Standardized on [ICU regular expressions][icu-regex], so you can stop toggling perl = TRUE/FALSE at random.
    • Results come back in a form that is much friendlier for downstream work.
  • The [Strings chapter][r4ds-strings] of [R for Data Science][r4ds] [@wickham2016] is a great resource.
  • RStudio Cheat Sheet on [Regular Expressions in R][rstudio-regex-cheatsheet].
  • Regex testers:
    • [regex101.com][regex101]
    • [regexr.com][regexr]

Character encoding resources

  • [Strings subsection of data import chapter][r4ds-readr-strings] in [R for Data Science][r4ds] [@wickham2016].
  • Screeds on the Minimum Everyone Needs to Know about encoding:
    • [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)][unicode-no-excuses]
    • [What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text][programmers-encoding]
  • Chapter @ref(character-encoding) - I’ve translated this blog post [Guide to fixing encoding problems in Ruby][encoding-probs-ruby] into R as the first step to developing a lesson.

Character vectors that live in a data frame

  • Certain operations are facilitated by tidyr. These are described below.
  • For a general discussion of how to work on variables that live in a data frame, see Vectors versus tibbles (Appendix @ref(oldies)).

Load the tidyverse, which includes stringr

library(tidyverse)

Regex-free string manipulation with stringr and tidyr

Basic string manipulation tasks:

  • Study a single character vector
    • How long are the strings?
    • Presence/absence of a literal string
  • Operate on a single character vector
    • Keep/discard elements that contain a literal string
    • Split into two or more character vectors using a fixed delimiter
    • Snip out pieces of the strings based on character position
    • Collapse into a single string
  • Operate on two or more character vectors
    • Glue them together element-wise to get a new character vector.

fruit, words, and sentences are character vectors that ship with stringr for practicing.

Detect or filter on a target string

Determine presence/absence of a literal string with str_detect(). Spoiler: later we see str_detect() also detects regular expressions.

Which fruits actually use the word “fruit”?

str_detect(fruit, pattern = "fruit")
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## [37] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE

What’s the easiest way to get the actual fruits that match? Use str_subset() to keep only the matching elements. Note we are storing this new vector my_fruit to use in later examples!

(my_fruit <- str_subset(fruit, pattern = "fruit"))
## [1] "breadfruit"   "dragonfruit"  "grapefruit"   "jackfruit"    "kiwi fruit"  
## [6] "passionfruit" "star fruit"   "ugli fruit"

String splitting by delimiter

Use stringr::str_split() to split strings on a delimiter. Some of our fruits are compound words, like “grapefruit”, but some have two words, like “ugli fruit”. Here we split on a single space " ", but show use of a regular expression later.

str_split(my_fruit, pattern = " ")
## [[1]]
## [1] "breadfruit"
## 
## [[2]]
## [1] "dragonfruit"
## 
## [[3]]
## [1] "grapefruit"
## 
## [[4]]
## [1] "jackfruit"
## 
## [[5]]
## [1] "kiwi"  "fruit"
## 
## [[6]]
## [1] "passionfruit"
## 
## [[7]]
## [1] "star"  "fruit"
## 
## [[8]]
## [1] "ugli"  "fruit"

It’s bummer that we get a list back. But it must be so! In full generality, split strings must return list, because who knows how many pieces there will be?

If you are willing to commit to the number of pieces, you can use str_split_fixed() and get a character matrix. You’re welcome!

str_split_fixed(my_fruit, pattern = " ", n = 2)
##      [,1]           [,2]   
## [1,] "breadfruit"   ""     
## [2,] "dragonfruit"  ""     
## [3,] "grapefruit"   ""     
## [4,] "jackfruit"    ""     
## [5,] "kiwi"         "fruit"
## [6,] "passionfruit" ""     
## [7,] "star"         "fruit"
## [8,] "ugli"         "fruit"

If the to-be-split variable lives in a data frame, tidyr::separate() will split it into 2 or more variables.

my_fruit_df <- tibble(my_fruit)
my_fruit_df %>% 
  separate(my_fruit, into = c("pre", "post"), sep = " ")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 5 rows [1, 2, 3,
## 4, 6].
## # A tibble: 8 x 2
##   pre          post 
##   <chr>        <chr>
## 1 breadfruit   <NA> 
## 2 dragonfruit  <NA> 
## 3 grapefruit   <NA> 
## 4 jackfruit    <NA> 
## 5 kiwi         fruit
## 6 passionfruit <NA> 
## 7 star         fruit
## 8 ugli         fruit

Substring extraction (and replacement) by position

Count characters in your strings with str_length(). Note this is different from the length of the character vector itself.

length(my_fruit)
## [1] 8
str_length(my_fruit)
## [1] 10 11 10  9 10 12 10 10

You can snip out substrings based on character position with str_sub().

head(fruit) %>% 
  str_sub(1, 3)
## [1] "app" "apr" "avo" "ban" "bel" "bil"

The start and end arguments are vectorised. Example: a sliding 3-character window.

tibble(fruit) %>% 
  head() %>% 
  mutate(snip = str_sub(fruit, 1:6, 3:8))
## # A tibble: 6 x 2
##   fruit       snip 
##   <chr>       <chr>
## 1 apple       "app"
## 2 apricot     "pri"
## 3 avocado     "oca"
## 4 banana      "ana"
## 5 bell pepper " pe"
## 6 bilberry    "rry"

Finally, str_sub() also works for assignment, i.e. on the left hand side of <-.

(x <- head(fruit, 3))
## [1] "apple"   "apricot" "avocado"
str_sub(x, 1, 3) <- "AAA"
x
## [1] "AAAle"   "AAAicot" "AAAcado"

Collapse a vector

You can collapse a character vector of length n > 1 to a single string with str_c(), which also has other uses (see the next section).

head(fruit) %>% 
  str_c(collapse = ", ")
## [1] "apple, apricot, avocado, banana, bell pepper, bilberry"

Create a character vector by catenating multiple vectors

If you have two or more character vectors of the same length, you can glue them together element-wise, to get a new vector of that length. Here are some … awful smoothie flavors?

str_c(fruit[1:4], fruit[5:8], sep = " & ")
## [1] "apple & bell pepper"   "apricot & bilberry"    "avocado & blackberry" 
## [4] "banana & blackcurrant"

Element-wise catenation can be combined with collapsing.

str_c(fruit[1:4], fruit[5:8], sep = " & ", collapse = ", ")
## [1] "apple & bell pepper, apricot & bilberry, avocado & blackberry, banana & blackcurrant"

If the to-be-combined vectors are variables in a data frame, you can use tidyr::unite() to make a single new variable from them.

fruit_df <- tibble(
  fruit1 = fruit[1:4],
  fruit2 = fruit[5:8]
)
fruit_df %>% 
  unite("flavor_combo", fruit1, fruit2, sep = " & ")
## # A tibble: 4 x 1
##   flavor_combo         
##   <chr>                
## 1 apple & bell pepper  
## 2 apricot & bilberry   
## 3 avocado & blackberry 
## 4 banana & blackcurrant

Substring replacement

You can replace a pattern with str_replace(). Here we use an explicit string-to-replace, but later we revisit with a regular expression.

str_replace(my_fruit, pattern = "fruit", replacement = "THINGY")
## [1] "breadTHINGY"   "dragonTHINGY"  "grapeTHINGY"   "jackTHINGY"   
## [5] "kiwi THINGY"   "passionTHINGY" "star THINGY"   "ugli THINGY"

A special case that comes up a lot is replacing NA, for which there is str_replace_na().

melons <- str_subset(fruit, pattern = "melon")
melons[2] <- NA
melons
## [1] "canary melon" NA             "watermelon"
str_replace_na(melons, "UNKNOWN MELON")
## [1] "canary melon"  "UNKNOWN MELON" "watermelon"

If the NA-afflicted variable lives in a data frame, you can use tidyr::replace_na().

tibble(melons) %>% 
  replace_na(replace = list(melons = "UNKNOWN MELON"))
## # A tibble: 3 x 1
##   melons       
##   <chr>        
## 1 canary melon 
## 2 UNKNOWN MELON
## 3 watermelon

And that concludes our treatment of regex-free manipulations of character data!

Regular expressions with stringr

The country names in the gapminder dataset are convenient for examples. Load it now and store the 142 unique country names in the object countries.

library(gapminder)
countries <- levels(gapminder$country)

Characters with special meaning

Frequently your string tasks cannot be expressed in terms of a fixed string, but can be described in terms of a pattern. Regular expressions, aka “regexes”, are the standard way to specify these patterns. In regexes, specific characters and constructs take on special meaning in order to match multiple strings.

The first metacharacter is the period ., which stands for any single character, except a newline (which by the way, is represented by \n). The regex a.b will match all countries that have an a, followed by any single character, followed by b. Yes, regexes are case sensitive, i.e. “Italy” does not match.

str_subset(countries, pattern = "i.a")
##  [1] "Argentina"                "Bosnia and Herzegovina"  
##  [3] "Burkina Faso"             "Central African Republic"
##  [5] "China"                    "Costa Rica"              
##  [7] "Dominican Republic"       "Hong Kong, China"        
##  [9] "Jamaica"                  "Mauritania"              
## [11] "Nicaragua"                "South Africa"            
## [13] "Swaziland"                "Taiwan"                  
## [15] "Thailand"                 "Trinidad and Tobago"

Notice that i.a matches “ina”, “ica”, “ita”, and more.

Anchors can be included to express where the expression must occur within the string. The ^ indicates the beginning of string and $ indicates the end.

Note how the regex i.a$ matches many fewer countries than i.a alone. Likewise, more elements of my_fruit match d than ^d, which requires “d” at string start.

str_subset(countries, pattern = "i.a$")
## [1] "Argentina"              "Bosnia and Herzegovina" "China"                 
## [4] "Costa Rica"             "Hong Kong, China"       "Jamaica"               
## [7] "South Africa"
str_subset(my_fruit, pattern = "d")
## [1] "breadfruit"  "dragonfruit"
str_subset(my_fruit, pattern = "^d")
## [1] "dragonfruit"

The metacharacter \b indicates a word boundary and \B indicates NOT a word boundary. This is our first encounter with something called “escaping” and right now I just want you at accept that we need to prepend a second backslash to use these sequences in regexes in R. We’ll come back to this tedious point later.

str_subset(fruit, pattern = "melon")
## [1] "canary melon" "rock melon"   "watermelon"
str_subset(fruit, pattern = "\\bmelon")
## [1] "canary melon" "rock melon"
str_subset(fruit, pattern = "\\Bmelon")
## [1] "watermelon"

Character classes

Characters can be specified via classes. You can make them explicitly “by hand” or use some pre-existing ones. Character classes are usually given inside square brackets, [] but a few come up so often that we have a metacharacter for them, such as \d for a single digit.

Here we match ia at the end of the country name, preceded by one of the characters in the class. Or, in the negated class, preceded by anything but one of those characters.

# Make a class "by hand"

str_subset(countries, pattern = "[nls]ia$")
##  [1] "Albania"    "Australia"  "Indonesia"  "Malaysia"   "Mauritania"
##  [6] "Mongolia"   "Romania"    "Slovenia"   "Somalia"    "Tanzania"  
## [11] "Tunisia"
# Use ^ to negate the class

str_subset(countries, pattern = "[^nls]ia$")
##  [1] "Algeria"      "Austria"      "Bolivia"      "Bulgaria"     "Cambodia"    
##  [6] "Colombia"     "Croatia"      "Ethiopia"     "Gambia"       "India"       
## [11] "Liberia"      "Namibia"      "Nigeria"      "Saudi Arabia" "Serbia"      
## [16] "Syria"        "Zambia"

Here we revisit splitting my_fruit with two more general ways to match whitespace: the \s metacharacter and the POSIX class [:space:]. Notice that we must prepend an extra backslash \ to escape \s and the POSIX class has to be surrounded by two sets of square brackets.

# Remember this?
# str_split_fixed(fruit, pattern = " ", n = 2)
# Alternatives:

str_split_fixed(my_fruit, pattern = "\\s", n = 2)
##      [,1]           [,2]   
## [1,] "breadfruit"   ""     
## [2,] "dragonfruit"  ""     
## [3,] "grapefruit"   ""     
## [4,] "jackfruit"    ""     
## [5,] "kiwi"         "fruit"
## [6,] "passionfruit" ""     
## [7,] "star"         "fruit"
## [8,] "ugli"         "fruit"
str_split_fixed(my_fruit, pattern = "[[:space:]]", n = 2)
##      [,1]           [,2]   
## [1,] "breadfruit"   ""     
## [2,] "dragonfruit"  ""     
## [3,] "grapefruit"   ""     
## [4,] "jackfruit"    ""     
## [5,] "kiwi"         "fruit"
## [6,] "passionfruit" ""     
## [7,] "star"         "fruit"
## [8,] "ugli"         "fruit"

Let’s see the country names that contain punctuation.

str_subset(countries, "[[:punct:]]")
## [1] "Congo, Dem. Rep." "Congo, Rep."      "Cote d'Ivoire"    "Guinea-Bissau"   
## [5] "Hong Kong, China" "Korea, Dem. Rep." "Korea, Rep."      "Yemen, Rep."

Quantifiers

You can decorate characters (and other constructs, like metacharacters and classes) with information about how many characters they are allowed to match.

quantifier meaning quantifier meaning
* 0 or more {n} exactly n
+ 1 or more {n,} at least n
? 0 or 1 {,m} at most m
{n,m} between n and m, inclusive

Explore these by inspecting matches for l followed by e, allowing for various numbers of characters in between.

l.*e will match strings with 0 or more characters in between, i.e. any string with an l eventually followed by an e. This is the most inclusive regex for this example, so we store the result as matches to use as a baseline for comparison.

(matches <- str_subset(fruit, pattern = "l.*e"))
##  [1] "apple"             "bell pepper"       "bilberry"         
##  [4] "blackberry"        "blood orange"      "blueberry"        
##  [7] "cantaloupe"        "chili pepper"      "clementine"       
## [10] "cloudberry"        "elderberry"        "huckleberry"      
## [13] "lemon"             "lime"              "lychee"           
## [16] "mulberry"          "olive"             "pineapple"        
## [19] "purple mangosteen" "salal berry"

Change the quantifier from * to + to require at least one intervening character. The strings that no longer match: all have a literal le with no preceding l and no following e.

list(match = intersect(matches, str_subset(fruit, pattern = "l.+e")),
     no_match = setdiff(matches, str_subset(fruit, pattern = "l.+e")))
## $match
##  [1] "bell pepper"       "bilberry"          "blackberry"       
##  [4] "blood orange"      "blueberry"         "cantaloupe"       
##  [7] "chili pepper"      "clementine"        "cloudberry"       
## [10] "elderberry"        "huckleberry"       "lime"             
## [13] "lychee"            "mulberry"          "olive"            
## [16] "purple mangosteen" "salal berry"      
## 
## $no_match
## [1] "apple"     "lemon"     "pineapple"

Change the quantifier from * to ? to require at most one intervening character. In the strings that no longer match, the shortest gap between l and following e is at least two characters.

list(match = intersect(matches, str_subset(fruit, pattern = "l.?e")),
     no_match = setdiff(matches, str_subset(fruit, pattern = "l.?e")))
## $match
##  [1] "apple"             "bilberry"          "blueberry"        
##  [4] "clementine"        "elderberry"        "huckleberry"      
##  [7] "lemon"             "mulberry"          "pineapple"        
## [10] "purple mangosteen"
## 
## $no_match
##  [1] "bell pepper"  "blackberry"   "blood orange" "cantaloupe"   "chili pepper"
##  [6] "cloudberry"   "lime"         "lychee"       "olive"        "salal berry"

Finally, we remove the quantifier and allow for no intervening characters. The strings that no longer match lack a literal le.

list(match = intersect(matches, str_subset(fruit, pattern = "le")),
     no_match = setdiff(matches, str_subset(fruit, pattern = "le")))
## $match
## [1] "apple"             "clementine"        "huckleberry"      
## [4] "lemon"             "pineapple"         "purple mangosteen"
## 
## $no_match
##  [1] "bell pepper"  "bilberry"     "blackberry"   "blood orange" "blueberry"   
##  [6] "cantaloupe"   "chili pepper" "cloudberry"   "elderberry"   "lime"        
## [11] "lychee"       "mulberry"     "olive"        "salal berry"

Escaping

You’ve probably caught on by now that there are certain characters with special meaning in regexes, including $ * + . ? [ ] ^ { } | ( ) \.

What if you really need the plus sign to be a literal plus sign and not a regex quantifier? You will need to escape it by prepending a backslash. But wait … there’s more! Before a regex is interpreted as a regular expression, it is also interpreted by R as a string. And backslash is used to escape there as well. So, in the end, you need to preprend two backslashes in order to match a literal plus sign in a regex.

This will be more clear with examples!

Escapes in plain old strings

Here is routine, non-regex use of backslash \ escapes in plain vanilla R strings. We intentionally use cat() instead of print() here.

  • To escape quotes inside quotes:

    cat("Do you use \"airquotes\" much?")
    ## Do you use "airquotes" much?

    Sidebar: eliminating the need for these escapes is exactly why people use double quotes inside single quotes and vice versa.

  • To insert newline (\n) or tab (\t):

    cat("before the newline\nafter the newline")
    ## before the newline
    ## after the newline
    cat("before the tab\tafter the tab")
    ## before the tab   after the tab

Escapes in regular expressions

Examples of using escapes in regexes to match characters that would otherwise have a special interpretation.

We know several gapminder country names contain a period. How do we isolate them? Although it’s tempting, this command str_subset(countries, pattern = ".") won’t work!

# Cheating using a POSIX class ;)

str_subset(countries, pattern = "[[:punct:]]")
## [1] "Congo, Dem. Rep." "Congo, Rep."      "Cote d'Ivoire"    "Guinea-Bissau"   
## [5] "Hong Kong, China" "Korea, Dem. Rep." "Korea, Rep."      "Yemen, Rep."
# Using two backslashes to escape the period

str_subset(countries, pattern = "\\.")
## [1] "Congo, Dem. Rep." "Congo, Rep."      "Korea, Dem. Rep." "Korea, Rep."     
## [5] "Yemen, Rep."

A last example that matches an actual square bracket.

(x <- c("whatever", "X is distributed U[0,1]"))
## [1] "whatever"                "X is distributed U[0,1]"
str_subset(x, pattern = "\\[")
## [1] "X is distributed U[0,1]"

Groups and backreferences

Your first use of regex is likely to be simple matching: detecting or isolating strings that match a pattern.

But soon you will want to use regexes to transform the strings in character vectors. That means you need a way to address specific parts of the matching strings and to operate on them.

You can use parentheses inside regexes to define groups and you can refer to those groups later with backreferences.

For now, this lesson will refer you to other place to read up on this:

  • STAT 545 2014 Intro to regular expressions by TA Gloria Li (Appendix @ref(oldies)).
  • The [Strings chapter][r4ds-strings] of [R for Data Science][r4ds] [@wickham2016].

Combining Data

There are many ways to bring data together.

Bind - This is basically smashing rocks tibbles together. You can smash things together row-wise (“row binding”) or column-wise (“column binding”). Why do I characterize this as rock-smashing? They’re often fairly crude operations, with lots of responsibility falling on the analyst for making sure that the whole enterprise even makes sense.

When row binding, you need to consider the variables in the two tibbles. Do the same variables exist in each? Are they of the same type? Different approaches for row binding have different combinations of flexibility vs rigidity around these matters.

When column binding, the onus is entirely on the analyst to make sure that the rows are aligned. I would avoid column binding whenever possible. If you can introduce new variables through any other, safer means, do so! By safer, I mean: use a mechanism where the row alignment is correct by definition. A proper join is the gold standard. In addition to joins, functions like dplyr::mutate() and tidyr::separate() can be very useful for forcing yourself to work inside the constraint of a tibble.

Join - Here you designate a variable (or a combination of variables) as a key. A row in one data frame gets matched with a row in another data frame because they have the same key. You can then bring information from variables in a secondary data frame into a primary data frame based on this key-based lookup. That description is incredibly oversimplified, but that’s the basic idea.

A variety of row- and column-wise operations fit into this framework, which implies there are many different flavors of join. The concepts and vocabulary around joins come from the database world. The relevant functions in dplyr follow this convention and all mention join. The most relevant base R function is merge().

Let’s explore each type of operation with a few examples.

Bind

Row binding

We used word count data from the Lord of the Rings trilogy to explore the concept of tidy data. That kicked off with a quiet, successful row bind. Let’s revisit that.

Here’s what a perfect row bind of three (untidy!) data frames looks like.

fship <- tribble(
                         ~Film,    ~Race, ~Female, ~Male,
  "The Fellowship Of The Ring",    "Elf",    1229,   971,
  "The Fellowship Of The Ring", "Hobbit",      14,  3644,
  "The Fellowship Of The Ring",    "Man",       0,  1995
)
rking <- tribble(
                         ~Film,    ~Race, ~Female, ~Male,
      "The Return Of The King",    "Elf",     183,   510,
      "The Return Of The King", "Hobbit",       2,  2673,
      "The Return Of The King",    "Man",     268,  2459
)
ttow <- tribble(
                         ~Film,    ~Race, ~Female, ~Male,
              "The Two Towers",    "Elf",     331,   513,
              "The Two Towers", "Hobbit",       0,  2463,
              "The Two Towers",    "Man",     401,  3589
)
(lotr_untidy <- bind_rows(fship, ttow, rking))
## # A tibble: 9 x 4
##   Film                       Race   Female  Male
##   <chr>                      <chr>   <dbl> <dbl>
## 1 The Fellowship Of The Ring Elf      1229   971
## 2 The Fellowship Of The Ring Hobbit     14  3644
## 3 The Fellowship Of The Ring Man         0  1995
## 4 The Two Towers             Elf       331   513
## 5 The Two Towers             Hobbit      0  2463
## 6 The Two Towers             Man       401  3589
## 7 The Return Of The King     Elf       183   510
## 8 The Return Of The King     Hobbit      2  2673
## 9 The Return Of The King     Man       268  2459

dplyr::bind_rows() works like a charm with these very row-bindable data frames! So does base rbind() (try it!).

But what if one of the data frames is somehow missing a variable? Let’s mangle one and find out.

ttow_no_Female <- ttow %>% mutate(Female = NULL)
bind_rows(fship, ttow_no_Female, rking)
## # A tibble: 9 x 4
##   Film                       Race   Female  Male
##   <chr>                      <chr>   <dbl> <dbl>
## 1 The Fellowship Of The Ring Elf      1229   971
## 2 The Fellowship Of The Ring Hobbit     14  3644
## 3 The Fellowship Of The Ring Man         0  1995
## 4 The Two Towers             Elf        NA   513
## 5 The Two Towers             Hobbit     NA  2463
## 6 The Two Towers             Man        NA  3589
## 7 The Return Of The King     Elf       183   510
## 8 The Return Of The King     Hobbit      2  2673
## 9 The Return Of The King     Man       268  2459
rbind(fship, ttow_no_Female, rking)
## Error in rbind(deparse.level, ...): numbers of columns of arguments do not match

We see that dplyr::bind_rows() does the row bind and puts NA in for the missing values caused by the lack of Female data from The Two Towers. Base rbind() refuses to row bind in this situation.

I invite you to experiment with other realistic, challenging scenarios, e.g.:

  • Change the order of variables. Does row binding match variables by name or position?
  • Row bind data frames where the variable x is of one type in one data frame and another type in the other. Try combinations that you think should work and some that should not. What actually happens?
  • Row bind data frames in which the factor x has different levels in one data frame and different levels in the other. What happens?

In conclusion, row binding usually works when it should (especially with dplyr::bind_rows()) and usually doesn’t when it shouldn’t. The biggest risk is being aggravated.

Column binding

Column binding is much more dangerous because it often “works” when it should not. It’s your job to make sure the rows are aligned and it’s all too easy to screw this up.

The data in gapminder was originally excavated from 3 messy Excel spreadsheets: one each for life expectancy, population, and GDP per capital. Let’s relive some of the data wrangling joy and show a column bind gone wrong.

I create 3 separate data frames, do some evil row sorting, then column bind. There are no errors. The result gapminder_garbage sort of looks OK. Univariate summary statistics and exploratory plots will look OK. But I’ve created complete nonsense!

library(gapminder)

life_exp <- gapminder %>%
  select(country, year, lifeExp)

pop <- gapminder %>%
  arrange(year) %>% 
  select(pop)
  
gdp_percap <- gapminder %>% 
  arrange(pop) %>% 
  select(gdpPercap)

(gapminder_garbage <- bind_cols(life_exp, pop, gdp_percap))
## # A tibble: 1,704 x 5
##    country      year lifeExp      pop gdpPercap
##    <fct>       <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan  1952    28.8  8425333      880.
##  2 Afghanistan  1957    30.3  1282697      861.
##  3 Afghanistan  1962    32.0  9279525     2670.
##  4 Afghanistan  1967    34.0  4232095     1072.
##  5 Afghanistan  1972    36.1 17876956     1385.
##  6 Afghanistan  1977    38.4  8691212     2865.
##  7 Afghanistan  1982    39.9  6927772     1533.
##  8 Afghanistan  1987    40.8   120447     1738.
##  9 Afghanistan  1992    41.7 46886859     3021.
## 10 Afghanistan  1997    41.8  8730405     1890.
## # … with 1,694 more rows
summary(gapminder$lifeExp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   23.60   48.20   60.71   59.47   70.85   82.60
summary(gapminder_garbage$lifeExp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   23.60   48.20   60.71   59.47   70.85   82.60
range(gapminder$gdpPercap)
## [1]    241.1659 113523.1329
range(gapminder_garbage$gdpPercap)
## [1]    241.1659 113523.1329

One last cautionary tale about column binding. This one requires the use of cbind() and it’s why the tidyverse is generally unwilling to recycle when combining things of different length.

I create a tibble with most of the gapminder columns. I create another with the remainder, but filtered down to just one country. I am able to cbind() these objects! Why? Because the 12 rows for Canada divide evenly into the 1704 rows of gapminder. Note that dplyr::bind_cols() refuses to column bind here.

gapminder_mostly <- gapminder %>% select(-pop, -gdpPercap)
gapminder_leftovers_filtered <- gapminder %>% 
  filter(country == "Canada") %>% 
  select(pop, gdpPercap)

gapminder_nonsense <- cbind(gapminder_mostly, gapminder_leftovers_filtered)
head(gapminder_nonsense, 14)
##        country continent year lifeExp      pop gdpPercap
## 1  Afghanistan      Asia 1952  28.801 14785584  11367.16
## 2  Afghanistan      Asia 1957  30.332 17010154  12489.95
## 3  Afghanistan      Asia 1962  31.997 18985849  13462.49
## 4  Afghanistan      Asia 1967  34.020 20819767  16076.59
## 5  Afghanistan      Asia 1972  36.088 22284500  18970.57
## 6  Afghanistan      Asia 1977  38.438 23796400  22090.88
## 7  Afghanistan      Asia 1982  39.854 25201900  22898.79
## 8  Afghanistan      Asia 1987  40.822 26549700  26626.52
## 9  Afghanistan      Asia 1992  41.674 28523502  26342.88
## 10 Afghanistan      Asia 1997  41.763 30305843  28954.93
## 11 Afghanistan      Asia 2002  42.129 31902268  33328.97
## 12 Afghanistan      Asia 2007  43.828 33390141  36319.24
## 13     Albania    Europe 1952  55.230 14785584  11367.16
## 14     Albania    Europe 1957  59.280 17010154  12489.95

This data frame isn’t obviously wrong, but it is wrong. See how the Canada’s population and GDP per capita repeat for each country?

Bottom line: Row bind when you need to, but inspect the results re: coercion. Column bind only if you must and be extremely paranoid.

Joins in dplyr

Visit Chapter @ref(join-cheatsheet) to see concrete examples of all the joins implemented in dplyr, based on comic characters and publishers.

The most recent release of gapminder includes a new data frame, country_codes, with country names and ISO codes. Therefore you can also use it to practice joins.

gapminder %>% 
  select(country, continent) %>% 
  group_by(country) %>% 
  slice(1) %>% 
  left_join(country_codes)
## Joining, by = "country"
## Warning: Column `country` joining factor and character vector, coercing into
## character vector
## # A tibble: 142 x 4
## # Groups:   country [142]
##    country     continent iso_alpha iso_num
##    <chr>       <fct>     <chr>       <int>
##  1 Afghanistan Asia      AFG             4
##  2 Albania     Europe    ALB             8
##  3 Algeria     Africa    DZA            12
##  4 Angola      Africa    AGO            24
##  5 Argentina   Americas  ARG            32
##  6 Australia   Oceania   AUS            36
##  7 Austria     Europe    AUT            40
##  8 Bahrain     Asia      BHR            48
##  9 Bangladesh  Asia      BGD            50
## 10 Belgium     Europe    BEL            56
## # … with 132 more rows

Joining

Join (a.k.a. merge) two tables: dplyr join cheatsheet with comic characters and publishers.

Other great places to read about joins:

  • The dplyr vignette on [Two-table verbs][dplyr-vignette-two-table].
  • The [Relational data chapter][r4ds-relational-data] in [R for Data Science][r4ds] [@wickham2016]. Excellent diagrams.

The data

Working with two small data frames: superheroes and publishers.

# dplyr provides the join functions

library(tidyverse)

superheroes <- tibble::tribble(
       ~name, ~alignment,  ~gender,          ~publisher,
   "Magneto",      "bad",   "male",            "Marvel",
     "Storm",     "good", "female",            "Marvel",
  "Mystique",      "bad", "female",            "Marvel",
    "Batman",     "good",   "male",                "DC",
     "Joker",      "bad",   "male",                "DC",
  "Catwoman",      "bad", "female",                "DC",
   "Hellboy",     "good",   "male", "Dark Horse Comics"
  )

publishers <- tibble::tribble(
  ~publisher, ~yr_founded,
        "DC",       1934L,
    "Marvel",       1939L,
     "Image",       1992L
  )

Sorry, cheat sheet does not illustrate “multiple match” situations terribly well.

Sub-plot: watch the row and variable order of the join results for a healthy reminder of why it’s dangerous to rely on any of that in an analysis.

inner_join(superheroes, publishers)

inner_join(x, y): Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join.

(ijsp <- inner_join(superheroes, publishers))
## Joining, by = "publisher"
## # A tibble: 6 x 5
##   name     alignment gender publisher yr_founded
##   <chr>    <chr>     <chr>  <chr>          <int>
## 1 Magneto  bad       male   Marvel          1939
## 2 Storm    good      female Marvel          1939
## 3 Mystique bad       female Marvel          1939
## 4 Batman   good      male   DC              1934
## 5 Joker    bad       male   DC              1934
## 6 Catwoman bad       female DC              1934

We lose Hellboy in the join because, although he appears in x = superheroes, his publisher Dark Horse Comics does not appear in y = publishers. The join result has all variables from x = superheroes plus yr_founded, from y.

superheroes
name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Hellboy good male Dark Horse Comics
publishers
publisher yr_founded
DC 1934
Marvel 1939
Image 1992
inner_join(x = superheroes, y = publishers)
name alignment gender publisher yr_founded
Magneto bad male Marvel 1939
Storm good female Marvel 1939
Mystique bad female Marvel 1939
Batman good male DC 1934
Joker bad male DC 1934
Catwoman bad female DC 1934

semi_join(superheroes, publishers)

semi_join(x, y): Return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x. This is a filtering join.

(sjsp <- semi_join(superheroes, publishers))
## Joining, by = "publisher"
## # A tibble: 6 x 4
##   name     alignment gender publisher
##   <chr>    <chr>     <chr>  <chr>    
## 1 Magneto  bad       male   Marvel   
## 2 Storm    good      female Marvel   
## 3 Mystique bad       female Marvel   
## 4 Batman   good      male   DC       
## 5 Joker    bad       male   DC       
## 6 Catwoman bad       female DC

We get a similar result as with inner_join() but the join result contains only the variables originally found in x = superheroes.

superheroes
name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Hellboy good male Dark Horse Comics
publishers
publisher yr_founded
DC 1934
Marvel 1939
Image 1992
semi_join(x = superheroes, y = publishers)
name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC

left_join(superheroes, publishers)

left_join(x, y): Return all rows from x, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join.

(ljsp <- left_join(superheroes, publishers))
## Joining, by = "publisher"
## # A tibble: 7 x 5
##   name     alignment gender publisher         yr_founded
##   <chr>    <chr>     <chr>  <chr>                  <int>
## 1 Magneto  bad       male   Marvel                  1939
## 2 Storm    good      female Marvel                  1939
## 3 Mystique bad       female Marvel                  1939
## 4 Batman   good      male   DC                      1934
## 5 Joker    bad       male   DC                      1934
## 6 Catwoman bad       female DC                      1934
## 7 Hellboy  good      male   Dark Horse Comics         NA

We basically get x = superheroes back, but with the addition of variable yr_founded, which is unique to y = publishers. Hellboy, whose publisher does not appear in y = publishers, has an NA for yr_founded.

superheroes
name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Hellboy good male Dark Horse Comics
publishers
publisher yr_founded
DC 1934
Marvel 1939
Image 1992
left_join(x = superheroes, y = publishers)
name alignment gender publisher yr_founded
Magneto bad male Marvel 1939
Storm good female Marvel 1939
Mystique bad female Marvel 1939
Batman good male DC 1934
Joker bad male DC 1934
Catwoman bad female DC 1934
Hellboy good male Dark Horse Comics NA

anti_join(superheroes, publishers)

anti_join(x, y): Return all rows from x where there are not matching values in y, keeping just columns from x. This is a filtering join.

(ajsp <- anti_join(superheroes, publishers))
## Joining, by = "publisher"
## # A tibble: 1 x 4
##   name    alignment gender publisher        
##   <chr>   <chr>     <chr>  <chr>            
## 1 Hellboy good      male   Dark Horse Comics

We keep only Hellboy now (and do not get yr_founded).

superheroes
name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Hellboy good male Dark Horse Comics
publishers
publisher yr_founded
DC 1934
Marvel 1939
Image 1992
anti_join(x = superheroes, y = publishers)
name alignment gender publisher
Hellboy good male Dark Horse Comics

inner_join(publishers, superheroes)

inner_join(x, y): Return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join.

(ijps <- inner_join(publishers, superheroes))
## Joining, by = "publisher"
## # A tibble: 6 x 5
##   publisher yr_founded name     alignment gender
##   <chr>          <int> <chr>    <chr>     <chr> 
## 1 DC              1934 Batman   good      male  
## 2 DC              1934 Joker    bad       male  
## 3 DC              1934 Catwoman bad       female
## 4 Marvel          1939 Magneto  bad       male  
## 5 Marvel          1939 Storm    good      female
## 6 Marvel          1939 Mystique bad       female

In a way, this does illustrate multiple matches, if you think about it from the x = publishers direction. Every publisher that has a match in y = superheroes appears multiple times in the result, once for each match. In fact, we’re getting the same result as with inner_join(superheroes, publishers), up to variable order (which you should also never rely on in an analysis).

publishers
publisher yr_founded
DC 1934
Marvel 1939
Image 1992
superheroes
name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Hellboy good male Dark Horse Comics
inner_join(x = publishers, y = superheroes)
publisher yr_founded name alignment gender
DC 1934 Batman good male
DC 1934 Joker bad male
DC 1934 Catwoman bad female
Marvel 1939 Magneto bad male
Marvel 1939 Storm good female
Marvel 1939 Mystique bad female

semi_join(publishers, superheroes)

semi_join(x, y): Return all rows from x where there are matching values in y, keeping just columns from x. A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x. This is a filtering join.

(sjps <- semi_join(x = publishers, y = superheroes))
## Joining, by = "publisher"
## # A tibble: 2 x 2
##   publisher yr_founded
##   <chr>          <int>
## 1 DC              1934
## 2 Marvel          1939

Now the effects of switching the x and y roles is more clear. The result resembles x = publishers, but the publisher Image is lost, because there are no observations where publisher == "Image" in y = superheroes.

publishers
publisher yr_founded
DC 1934
Marvel 1939
Image 1992
superheroes
name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Hellboy good male Dark Horse Comics
semi_join(x = publishers, y = superheroes)
publisher yr_founded
DC 1934
Marvel 1939

left_join(publishers, superheroes)

left_join(x, y): Return all rows from x, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. This is a mutating join.

(ljps <- left_join(publishers, superheroes))
## Joining, by = "publisher"
## # A tibble: 7 x 5
##   publisher yr_founded name     alignment gender
##   <chr>          <int> <chr>    <chr>     <chr> 
## 1 DC              1934 Batman   good      male  
## 2 DC              1934 Joker    bad       male  
## 3 DC              1934 Catwoman bad       female
## 4 Marvel          1939 Magneto  bad       male  
## 5 Marvel          1939 Storm    good      female
## 6 Marvel          1939 Mystique bad       female
## 7 Image           1992 <NA>     <NA>      <NA>

We get a similar result as with inner_join() but the publisher Image survives in the join, even though no superheroes from Image appear in y = superheroes. As a result, Image has NAs for name, alignment, and gender.

publishers
publisher yr_founded
DC 1934
Marvel 1939
Image 1992
superheroes
name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Hellboy good male Dark Horse Comics
left_join(x = publishers, y = superheroes)
publisher yr_founded name alignment gender
DC 1934 Batman good male
DC 1934 Joker bad male
DC 1934 Catwoman bad female
Marvel 1939 Magneto bad male
Marvel 1939 Storm good female
Marvel 1939 Mystique bad female
Image 1992 NA NA NA

anti_join(publishers, superheroes)

anti_join(x, y): Return all rows from x where there are not matching values in y, keeping just columns from x. This is a filtering join.

(ajps <- anti_join(publishers, superheroes))
## Joining, by = "publisher"
## # A tibble: 1 x 2
##   publisher yr_founded
##   <chr>          <int>
## 1 Image           1992

We keep only publisher Image now (and the variables found in x = publishers).

publishers
publisher yr_founded
DC 1934
Marvel 1939
Image 1992
superheroes
name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Hellboy good male Dark Horse Comics
anti_join(x = publishers, y = superheroes)
publisher yr_founded
Image 1992

full_join(superheroes, publishers)

full_join(x, y): Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing. This is a mutating join.

(fjsp <- full_join(superheroes, publishers))
## Joining, by = "publisher"
## # A tibble: 8 x 5
##   name     alignment gender publisher         yr_founded
##   <chr>    <chr>     <chr>  <chr>                  <int>
## 1 Magneto  bad       male   Marvel                  1939
## 2 Storm    good      female Marvel                  1939
## 3 Mystique bad       female Marvel                  1939
## 4 Batman   good      male   DC                      1934
## 5 Joker    bad       male   DC                      1934
## 6 Catwoman bad       female DC                      1934
## 7 Hellboy  good      male   Dark Horse Comics         NA
## 8 <NA>     <NA>      <NA>   Image                   1992

We get all rows of x = superheroes plus a new row from y = publishers, containing the publisher Image. We get all variables from x = superheroes AND all variables from y = publishers. Any row that derives solely from one table or the other carries NAs in the variables found only in the other table.

superheroes
name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Hellboy good male Dark Horse Comics
publishers
publisher yr_founded
DC 1934
Marvel 1939
Image 1992
full_join(x = superheroes, y = publishers)
name alignment gender publisher yr_founded
Magneto bad male Marvel 1939
Storm good female Marvel 1939
Mystique bad female Marvel 1939
Batman good male DC 1934
Joker bad male DC 1934
Catwoman bad female DC 1934
Hellboy good male Dark Horse Comics NA
NA NA NA Image 1992

join data frames

“Joining” or “merging” two different datasets is tricky stuff. Let’s go through some more examples while reviewing the basic concepts. In the flights data frame, the variable carrier lists the carrier code for the different flights. While the corresponding airline names for "UA" and "AA" might be somewhat easy to guess (United and American Airlines), what airlines have codes "VX", "HA", and "B6"? This information is provided in a separate data frame airlines.

View(airlines)

We see that in airports, carrier is the carrier code, while name is the full name of the airline company. Using this table, we can see that "VX", "HA", and "B6" correspond to Virgin America, Hawaiian Airlines, and JetBlue, respectively. However, wouldn’t it be nice to have all this information in a single data frame instead of two separate data frames? We can do this by “joining” the flights and airlines data frames.

Note that the values in the variable carrier in the flights data frame match the values in the variable carrier in the airlines data frame. In this case, we can use the variable carrier as a key variable to match the rows of the two data frames. Key variables are almost always identification variables that uniquely identify the observational units as we saw in Subsection @ref(identification-vs-measurement-variables). This ensures that rows in both data frames are appropriately matched during the join. Hadley and Garrett [@rds2016] created the diagram shown in Figure @ref(fig:reldiagram) to help us understand how the different data frames in the nycflights13 package are linked by various key variables:

(ref:relationships-nycflights13) Data relationships in nycflights13 from R for Data Science.

Relationships among nycflights tables

Relationships among nycflights tables

Matching “key” variable names

In both the flights and airlines data frames, the key variable we want to join/merge/match the rows by has the same name: carrier. Let’s use the inner_join() function to join the two data frames, where the rows will be matched by the variable carrier, and then compare the resulting data frames:

flights_joined <- flights %>% 
  inner_join(airlines, by = "carrier")
View(flights)
View(flights_joined)

Observe that the flights and flights_joined data frames are identical except that flights_joined has an additional variable name. The values of name correspond to the airline companies’ names as indicated in the airlines data frame.

A visual representation of the inner_join() is shown in Figure @ref(fig:ijdiagram) [@rds2016]. There are other types of joins available (such as left_join(), right_join(), outer_join(), and anti_join()), but the inner_join() will solve nearly all of the problems you’ll encounter in this book.

(ref:inner-join-r4ds) Diagram of inner join from R for Data Science.

Inner join.

Inner join.

Different “key” variable names

Say instead you are interested in the destinations of all domestic flights departing NYC in 2013, and you ask yourself questions like: “What cities are these airports in?”, or “Is "ORD" Orlando?”, or “Where is "FLL"?”.

The airports data frame contains the airport codes for each airport:

View(airports)

However, if you look at both the airports and flights data frames, you’ll find that the airport codes are in variables that have different names. In airports the airport code is in faa, whereas in flights the airport codes are in origin and dest. This fact is further highlighted in the visual representation of the relationships between these data frames in Figure @ref(fig:reldiagram).

In order to join these two data frames by airport code, our inner_join() operation will use the by = c("dest" = "faa") argument with modified code syntax allowing us to join two data frames where the key variable has a different name:

flights_with_airport_names <- flights %>% 
  inner_join(airports, by = c("dest" = "faa"))
View(flights_with_airport_names)

Let’s construct the chain of pipe operators %>% that computes the number of flights from NYC to each destination, but also includes information about each destination airport:

named_dests <- flights %>%
  group_by(dest) %>%
  summarize(num_flights = n()) %>%
  arrange(desc(num_flights)) %>%
  inner_join(airports, by = c("dest" = "faa")) %>%
  rename(airport_name = name)
named_dests
## # A tibble: 101 x 9
##    dest  num_flights airport_name          lat    lon   alt    tz dst   tzone   
##    <chr>       <int> <chr>               <dbl>  <dbl> <dbl> <dbl> <chr> <chr>   
##  1 ORD         17283 Chicago Ohare Intl   42.0  -87.9   668    -6 A     America…
##  2 ATL         17215 Hartsfield Jackson…  33.6  -84.4  1026    -5 A     America…
##  3 LAX         16174 Los Angeles Intl     33.9 -118.    126    -8 A     America…
##  4 BOS         15508 General Edward Law…  42.4  -71.0    19    -5 A     America…
##  5 MCO         14082 Orlando Intl         28.4  -81.3    96    -5 A     America…
##  6 CLT         14064 Charlotte Douglas …  35.2  -80.9   748    -5 A     America…
##  7 SFO         13331 San Francisco Intl   37.6 -122.     13    -8 A     America…
##  8 FLL         12055 Fort Lauderdale Ho…  26.1  -80.2     9    -5 A     America…
##  9 MIA         11728 Miami Intl           25.8  -80.3     8    -5 A     America…
## 10 DCA          9705 Ronald Reagan Wash…  38.9  -77.0    15    -5 A     America…
## # … with 91 more rows

In case you didn’t know, "ORD" is the airport code of Chicago O’Hare airport and "FLL" is the main airport in Fort Lauderdale, Florida, which can be seen in the airport_name variable.

Multiple “key” variables

Say instead we want to join two data frames by multiple key variables. For example, in Figure @ref(fig:reldiagram), we see that in order to join the flights and weather data frames, we need more than one key variable: year, month, day, hour, and origin. This is because the combination of these 5 variables act to uniquely identify each observational unit in the weather data frame: hourly weather recordings at each of the 3 NYC airports.

We achieve this by specifying a vector of key variables to join by using the c() function. Recall from Subsection @ref(programming-concepts) that c() is short for “combine” or “concatenate.”

flights_weather_joined <- flights %>%
  inner_join(weather, by = c("year", "month", "day", "hour", "origin"))
View(flights_weather_joined)

Normal forms

The data frames included in the nycflights13 package are in a form that minimizes redundancy of data. For example, the flights data frame only saves the carrier code of the airline company; it does not include the actual name of the airline. For example, the first row of flights has carrier equal to UA, but it does not include the airline name of “United Air Lines Inc.”

The names of the airline companies are included in the name variable of the airlines data frame. In order to have the airline company name included in flights, we could join these two data frames as follows:

joined_flights <- flights %>% 
  inner_join(airlines, by = "carrier")
View(joined_flights)

We are capable of performing this join because each of the data frames have keys in common to relate one to another: the carrier variable in both the flights and airlines data frames. The key variable(s) that we base our joins on are often identification variables as we mentioned previously.

This is an important property of what’s known as normal forms of data. The process of decomposing data frames into less redundant tables without losing information is called normalization. More information is available on Wikipedia.

Both dplyr and SQL we mentioned in the introduction of this chapter use such normal forms. Given that they share such commonalities, once you learn either of these two tools, you can learn the other very easily.

Other Verbs

Here are some other useful data wrangling verbs:

  • select() only a subset of variables/columns.
  • rename() variables/columns to have new names.
  • Return only the top_n() values of a variable.
  • slice() and pull() specific rows and columns.

select variables

Diagram of select() columns.

Diagram of select() columns.

We’ve seen that the flights data frame in the nycflights13 package contains 19 different variables. You can identify the names of these 19 variables by running the glimpse() function from the dplyr package:

glimpse(flights)

However, say you only need two of these 19 variables, say carrier and flight. You can select() these two variables:

flights %>% 
  select(carrier, flight)

This function makes it easier to explore large datasets since it allows us to limit the scope to only those variables we care most about. For example, if we select() only a smaller number of variables as is shown in Figure @ref(fig:selectfig), it will make viewing the dataset in RStudio’s spreadsheet viewer more digestible.

Let’s say instead you want to drop, or de-select, certain variables. For example, consider the variable year in the flights data frame. This variable isn’t quite a “variable” because it is always 2013 and hence doesn’t change. Say you want to remove this variable from the data frame. We can deselect year by using the - sign:

flights_no_year <- flights %>% select(-year)

Another way of selecting columns/variables is by specifying a range of columns:

flight_arr_times <- flights %>% select(month:day, arr_time:sched_arr_time)
flight_arr_times

This will select() all columns between month and day, as well as between arr_time and sched_arr_time, and drop the rest.

The select() function can also be used to reorder columns when used with the everything() helper function. For example, suppose we want the hour, minute, and time_hour variables to appear immediately after the year, month, and day variables, while not discarding the rest of the variables. In the following code, everything() will pick up all remaining variables:

flights_reorder <- flights %>% 
  select(year, month, day, hour, minute, time_hour, everything())
glimpse(flights_reorder)

Lastly, the helper functions starts_with(), ends_with(), and contains() can be used to select variables/columns that match those conditions. As examples,

flights %>% select(starts_with("a"))
flights %>% select(ends_with("delay"))
flights %>% select(contains("time"))

rename variables

Another useful function is rename(), which as you may have guessed changes the name of variables. Suppose we want to only focus on dep_time and arr_time and change dep_time and arr_time to be departure_time and arrival_time instead in the flights_time data frame:

flights_time_new <- flights %>% 
  select(dep_time, arr_time) %>% 
  rename(departure_time = dep_time, arrival_time = arr_time)
glimpse(flights_time_new)

Note that in this case we used a single = sign within the rename(). For example, departure_time = dep_time renames the dep_time variable to have the new name departure_time. This is because we are not testing for equality like we would using ==. Instead we want to assign a new variable departure_time to have the same values as dep_time and then delete the variable dep_time. Note that new dplyr users often forget that the new variable name comes before the equal sign.

top_n values of a variable

We can also return the top n values of a variable using the top_n() function. For example, we can return a data frame of the top 10 destination airports using the example from Subsection @ref(diff-key). Observe that we set the number of values to return to n = 10 and wt = num_flights to indicate that we want the rows corresponding to the top 10 values of num_flights. See the help file for top_n() by running ?top_n for more information.

named_dests %>% top_n(n = 10, wt = num_flights)

Let’s further arrange() these results in descending order of num_flights:

named_dests  %>% 
  top_n(n = 10, wt = num_flights) %>% 
  arrange(desc(num_flights))

slice and pull and []

Using slice() gives us specific rows from the flights tibble:

slice(flights, 2:5)
## # A tibble: 4 x 22
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      533            529         4      850            830
## 2  2013     1     1      542            540         2      923            850
## 3  2013     1     1      544            545        -1     1004           1022
## 4  2013     1     1      554            600        -6      812            837
## # … with 14 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, gain <dbl>, hours <dbl>,
## #   gain_per_hour <dbl>

pull() grabs out a variable as a vector, rather than leaving it within a tibble, as select() does:

slice(flights, 2:5) %>% 
  pull(dep_time)
## [1] 533 542 544 554

This is often handy when you want to feed the data into a function, like mean() which requires a vector as input:

slice(flights, 2:5) %>% 
  pull(dep_time) %>% 
  mean()
## [1] 543.25

The most common way to subset vectors is to use the “bracket” operator []. Example:

flights$dep_time[2:5]
## [1] 533 542 544 554

Conclusion

Summary table

Let’s recap our data wrangling verbs in Table @ref(tab:wrangle-summary-table). Using these verbs and the pipe %>% operator from Section @ref(piping), you’ll be able to write easily legible code to perform almost all the data wrangling and data transformation necessary for the rest of this book.

Summary of data wrangling verbs
Verb Data wrangling operation
filter() Pick out a subset of rows
summarize() Summarize many values to one using a summary statistic function like mean(), median(), etc.
group_by() Add grouping structure to rows in data frame. Note this does not change values in data frame, rather only the meta-data
mutate() Create new variables by mutating existing ones
arrange() Arrange rows of a data variable in ascending (default) or descending order
inner_join() Join/merge two data frames, matching rows by a key variable

Additional resources

If you want to further unlock the power of the dplyr package for data wrangling, we suggest that you check out RStudio’s “Data Transformation with dplyr” cheatsheet. This cheatsheet summarizes much more than what we’ve discussed in this chapter, in particular more intermediate level and advanced data wrangling functions, while providing quick and easy-to-read visual descriptions. In fact, many of the diagrams illustrating data wrangling operations in this chapter, such as Figure @ref(fig:filter) on filter(), originate from this cheatsheet.

In the current version of RStudio in late 2019, you can access this cheatsheet by going to the RStudio Menu Bar -> Help -> Cheatsheets -> “Data Transformation with dplyr.” You can see a preview in the figure below.

Data Transformation with dplyr cheatsheet.

Data Transformation with dplyr cheatsheet.

On top of the data wrangling verbs and examples we presented in this section, if you’d like to see more examples of using the dplyr package for data wrangling, check out Chapter 5 of R for Data Science [@rds2016].

Tidy

In Subsection @ref(programming-concepts), we introduced the concept of a data frame in R: a rectangular spreadsheet-like representation of data where the rows correspond to observations and the columns correspond to variables describing each observation. In Section @ref(nycflights13), we started exploring our first data frame: the flights data frame included in the nycflights13 package. In Chapter @ref(viz), we created visualizations based on the data included in flights and other data frames such as weather. In Chapter @ref(wrangling), we learned how to take existing data frames and transform/modify them to suit our ends.

Let’s extend some of these ideas by discussing a type of data formatting called “tidy” data. You will see that having data stored in “tidy” format is about more than just what the everyday definition of the term “tidy” might suggest: having your data “neatly organized.” Instead, we define the term “tidy” as it’s used by data scientists who use R, outlining a set of rules by which data is saved.

Knowledge of this type of data formatting was not necessary for our treatment of data visualization in Chapter @ref(viz) and data wrangling in Chapter @ref(wrangling). This is because all the data used were already in “tidy” format. In this chapter, we’ll now see that this format is essential to using the tools we covered up until now. Furthermore, it will also be useful for all subsequent chapters in this book when we cover regression and statistical inference. First, however, we’ll show you how to import spreadsheet data in R.

Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). If needed, read Section @ref(packages) for information on how to install and load R packages.

library(dplyr)
library(ggplot2)
library(readr)
library(tidyr)
library(nycflights13)
library(fivethirtyeight)

Importing data

Up to this point, we’ve almost entirely used data stored inside of an R package. Say instead you have your own data saved on your computer or somewhere online. How can you analyze this data in R? Spreadsheet data is often saved in one of the following three formats:

First, a Comma Separated Values .csv file. You can think of a .csv file as a bare-bones spreadsheet where:

  • Each line in the file corresponds to one row of data/one observation.
  • Values for each line are separated with commas. In other words, the values of different variables are separated by commas in each row.
  • The first line is often, but not always, a header row indicating the names of the columns/variables.

Second, an Excel .xlsx spreadsheet file. This format is based on Microsoft’s proprietary Excel software. As opposed to bare-bones .csv files, .xlsx Excel files contain a lot of meta-data (data about data). Recall we saw a previous example of meta-data in Section @ref(groupby) when adding “group structure” meta-data to a data frame by using the group_by() verb. Some examples of Excel spreadsheet meta-data include the use of bold and italic fonts, colored cells, different column widths, and formula macros.

Third, a Google Sheets file, which is a “cloud” or online-based way to work with a spreadsheet. Google Sheets allows you to download your data in both comma separated values .csv and Excel .xlsx formats. To import Google Sheets data in R use the googlesheets4 package.

Let’s import a Comma Separated Values .csv file that exists on the internet. The .csv file dem_score.csv contains ratings of the level of democracy in different countries spanning 1952 to 1992 and is accessible at https://moderndive.com/data/dem_score.csv. Let’s use the read_csv() function from the readr [@R-readr] package to read it off the web, import it into R, and save it in a data frame called dem_score.

library(readr)
dem_score <- read_csv("https://moderndive.com/data/dem_score.csv")
dem_score
## # A tibble: 96 x 10
##    country    `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992`
##    <chr>       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 Albania        -9     -9     -9     -9     -9     -9     -9     -9      5
##  2 Argentina      -9     -1     -1     -9     -9     -9     -8      8      7
##  3 Armenia        -9     -7     -7     -7     -7     -7     -7     -7      7
##  4 Australia      10     10     10     10     10     10     10     10     10
##  5 Austria        10     10     10     10     10     10     10     10     10
##  6 Azerbaijan     -9     -7     -7     -7     -7     -7     -7     -7      1
##  7 Belarus        -9     -7     -7     -7     -7     -7     -7     -7      7
##  8 Belgium        10     10     10     10     10     10     10     10     10
##  9 Bhutan        -10    -10    -10    -10    -10    -10    -10    -10    -10
## 10 Bolivia        -4     -3     -3     -4     -7     -7      8      9      9
## # … with 86 more rows

In this dem_score data frame, the minimum value of -10 corresponds to a highly autocratic nation, whereas a value of 10 corresponds to a highly democratic nation. Note also that backticks surround the different variable names. Variable names in R by default are not allowed to start with a number nor include spaces, but we can get around this fact by surrounding the column name with backticks. We’ll revisit the dem_score data frame in a case study in the upcoming Section @ref(case-study-tidy).

Note that the read_csv() function included in the readr package is different than the read.csv() function that comes installed with R. While the difference in the names might seem trivial (an _ instead of a .), the read_csv() function is, in our opinion, easier to use since it can more easily read data off the web and generally imports data at a much faster speed. Furthermore, the read_csv() function included in the readr saves data frames as tibbles by default.

Web scraping

The data we need to answer a question is not always in a spreadsheet ready for us to read. For example, the US murders dataset we used in the R Basics chapter originally comes from this Wikipedia page:

url <- paste0("https://en.wikipedia.org/w/index.php?title=",
              "Gun_violence_in_the_United_States_by_state",
              "&direction=prev&oldid=810166167")

You can see the data table when you visit the webpage:

To get this data, we need to do some web scraping.

Web scraping, or web harvesting, is the term we use to describe the process of extracting data from a website. The reason we can do this is because the information used by a browser to render webpages is received as a text file from a server. The text is code written in hyper text markup language (HTML). Every browser has a way to show the html source code for a page, each one different. On Chrome, you can use Control-U on a PC and command+alt+U on a Mac. You will see something like this:

HTML

Because this code is accessible, we can download the HTML file, import it into R, and then write programs to extract the information we need from the page. However, once we look at HTML code, this might seem like a daunting task. But we will show you some convenient tools to facilitate the process. To get an idea of how it works, here are a few lines of code from the Wikipedia page that provides the US murders data:

<table class="wikitable sortable">
<tr>
<th>State</th>
<th><a href="/wiki/List_of_U.S._states_and_territories_by_population" 
title="List of U.S. states and territories by population">Population</a><br />
<small>(total inhabitants)</small><br />
<small>(2015)</small> <sup id="cite_ref-1" class="reference">
<a href="#cite_note-1">[1]</a></sup></th>
<th>Murders and Nonnegligent
<p>Manslaughter<br />
<small>(total deaths)</small><br />
<small>(2015)</small> <sup id="cite_ref-2" class="reference">
<a href="#cite_note-2">[2]</a></sup></p>
</th>
<th>Murder and Nonnegligent
<p>Manslaughter Rate<br />
<small>(per 100,000 inhabitants)</small><br />
<small>(2015)</small></p>
</th>
</tr>
<tr>
<td><a href="/wiki/Alabama" title="Alabama">Alabama</a></td>
<td>4,853,875</td>
<td>348</td>
<td>7.2</td>
</tr>
<tr>
<td><a href="/wiki/Alaska" title="Alaska">Alaska</a></td>
<td>737,709</td>
<td>59</td>
<td>8.0</td>
</tr>
<tr>

You can actually see the data, except data values are surrounded by html code such as <td>. We can also see a pattern of how it is stored. If you know HTML, you can write programs that leverage knowledge of these patterns to extract what we want. We also take advantage of a language widely used to make webpages look “pretty” called Cascading Style Sheets (CSS). We say more about this in Section @ref(css-selectors).

Although we provide tools that make it possible to scrape data without knowing HTML, as a data scientist it is quite useful to learn some HTML and CSS. Not only does this improve your scraping skills, but it might come in handy if you are creating a webpage to showcase your work.

The rvest package

The tidyverse provides a web harvesting package called rvest. The first step using this package is to import the webpage into R. The package makes this quite simple:

library(tidyverse)
library(rvest)
h <- read_html(url)

Note that the entire Murders in the US Wikipedia webpage is now contained in h. The class of this object is:

class(h)
## [1] "xml_document" "xml_node"

The rvest package is actually more general; it handles XML documents. XML is a general markup language (that’s what the ML stands for) that can be used to represent any kind of data. HTML is a specific type of XML specifically developed for representing webpages. Here we focus on HTML documents.

Now, how do we extract the table from the object h? If we print h, we don’t really see much:

h
## {html_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject  ...

We can see all the code that defines the downloaded webpage using the html_text function like this:

html_text(h)

We don’t show the output here because it includes thousands of characters, but if we look at it, we can see the data we are after are stored in an HTML table: you can see this in this line of the HTML code above <table class="wikitable sortable">. The different parts of an HTML document, often defined with a message in between < and > are referred to as nodes. The rvest package includes functions to extract nodes of an HTML document: html_nodes extracts all nodes of different types and html_node extracts the first one. To extract the tables from the html code we use:

tab <- h %>% html_nodes("table")

Now, instead of the entire webpage, we just have the html code for the tables in the page:

tab
## {xml_nodeset (2)}
## [1] <table class="wikitable sortable"><tbody>\n<tr>\n<th>State\n</th>\n<th>\n ...
## [2] <table class="nowraplinks hlist mw-collapsible mw-collapsed navbox-inner" ...

The table we are interested is the first one:

tab[[1]]
## {html_node}
## <table class="wikitable sortable">
## [1] <tbody>\n<tr>\n<th>State\n</th>\n<th>\n<a href="/wiki/List_of_U.S._states ...

This is clearly not a tidy dataset, not even a data frame. In the code above, you can definitely see a pattern and writing code to extract just the data is very doable. In fact, rvest includes a function just for converting HTML tables into data frames:

tab <- tab[[1]] %>% html_table
class(tab)
## [1] "data.frame"

We are now much closer to having a usable data table:

tab <- tab %>% setNames(c("state", "population", "total", "murder_rate")) 
head(tab)
##        state population total murder_rate
## 1    Alabama  4,853,875   348         7.2
## 2     Alaska    737,709    59         8.0
## 3    Arizona  6,817,565   309         4.5
## 4   Arkansas  2,977,853   181         6.1
## 5 California 38,993,940 1,861         4.8
## 6   Colorado  5,448,819   176         3.2

We still have some wrangling to do. For example, we need to remove the commas and turn characters into numbers. Before continuing with this, we will learn a more general approach to extracting information from web sites.

CSS selectors

The default look of a webpage made with the most basic HTML is quite unattractive. The aesthetically pleasing pages we see today are made using CSS to define the look and style of webpages. The fact that all pages for a company have the same style usually results from their use of the same CSS file to define the style. The general way these CSS files work is by defining how each of the elements of a webpage will look. The title, headings, itemized lists, tables, and links, for example, each receive their own style including font, color, size, and distance from the margin. CSS does this by leveraging patterns used to define these elements, referred to as selectors. An example of such a pattern, which we used above, is table, but there are many, many more.

If we want to grab data from a webpage and we happen to know a selector that is unique to the part of the page containing this data, we can use the html_nodes function. However, knowing which selector can be quite complicated. In fact, the complexity of webpages has been increasing as they become more sophisticated. For some of the more advanced ones, it seems almost impossible to find the nodes that define a particular piece of data. However, selector gadgets actually make this possible.

SelectorGadget1 is piece of software that allows you to interactively determine what CSS selector you need to extract specific components from the webpage. If you plan on scraping data other than tables from html pages, we highly recommend you install it. A Chrome extension is available which permits you to turn on the gadget and then, as you click through the page, it highlights parts and shows you the selector you need to extract these parts. There are various demos of how to do this including rvest author Hadley Wickham’s vignette2 and other tutorials based on the vignette3 4.

JSON

Sharing data on the internet has become more and more common. Unfortunately, providers use different formats, which makes it harder for data scientists to wrangle data into R. Yet there are some standards that are also becoming more common. Currently, a format that is widely being adopted is the JavaScript Object Notation or JSON. Because this format is very general, it is nothing like a spreadsheet. This JSON file looks more like the code you use to define a list. Here is an example of information stored in a JSON format:

## 
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
## 
##     flatten
## [
##   {
##     "name": "Miguel",
##     "student_id": 1,
##     "exam_1": 85,
##     "exam_2": 86
##   },
##   {
##     "name": "Sofia",
##     "student_id": 2,
##     "exam_1": 94,
##     "exam_2": 93
##   },
##   {
##     "name": "Aya",
##     "student_id": 3,
##     "exam_1": 87,
##     "exam_2": 88
##   },
##   {
##     "name": "Cheng",
##     "student_id": 4,
##     "exam_1": 90,
##     "exam_2": 91
##   }
## ]

The file above actually represents a data frame. To read it, we can use the function fromJSON from the jsonlite package. Note that JSON files are often made available via the internet. Several organizations provide a JSON API or a web service that you can connect directly to and obtain data. Here is an example:

library(jsonlite)
citi_bike <- fromJSON("http://citibikenyc.com/stations/json")

This downloads a list. The first argument tells you when you downloaded it:

citi_bike$executionTime
## [1] "2020-05-20 08:51:40 AM"

and the second is a data table:

citi_bike$stationBeanList %>% as_tibble() 
## # A tibble: 955 x 18
##       id stationName availableDocks totalDocks latitude longitude statusValue
##    <int> <chr>                <int>      <int>    <dbl>     <dbl> <chr>      
##  1    72 W 52 St & …             17         55     40.8     -74.0 In Service 
##  2    79 Franklin S…             14         33     40.7     -74.0 In Service 
##  3    82 St James P…              8         27     40.7     -74.0 In Service 
##  4    83 Atlantic A…              8         62     40.7     -74.0 In Service 
##  5   116 W 17 St & …             37         50     40.7     -74.0 In Service 
##  6   119 Park Ave &…              8         19     40.7     -74.0 In Service 
##  7   120 Lexington …              5         19     40.7     -74.0 In Service 
##  8   127 Barrow St …             17         31     40.7     -74.0 In Service 
##  9   128 MacDougal …              6         30     40.7     -74.0 In Service 
## 10   143 Clinton St…              4         24     40.7     -74.0 In Service 
## # … with 945 more rows, and 11 more variables: statusKey <int>,
## #   availableBikes <int>, stAddress1 <chr>, stAddress2 <chr>, city <chr>,
## #   postalCode <chr>, location <chr>, altitude <chr>, testStation <lgl>,
## #   lastCommunicationTime <chr>, landMark <chr>

You can learn much more by examining tutorials and help files from the jsonlite package. This package is intended for relatively simple tasks such as converging data into tables. For more flexibility, we recommend rjson.

“Tidy” data

Let’s now switch gears and learn about the concept of “tidy” data format with a motivating example from the fivethirtyeight package. The fivethirtyeight package [@R-fivethirtyeight] provides access to the datasets used in many articles published by the data journalism website, FiveThirtyEight.com. For a complete list of all 127 datasets included in the fivethirtyeight package, check out the package webpage by going to: https://fivethirtyeight-r.netlify.com/articles/fivethirtyeight.html.

Let’s focus our attention on the drinks data frame and look at its first 5 rows:

## # A tibble: 5 x 5
##   country    beer_servings spirit_servings wine_servings total_litres_of_pure_a…
##   <chr>              <int>           <int>         <int>                   <dbl>
## 1 Afghanist…             0               0             0                     0  
## 2 Albania               89             132            54                     4.9
## 3 Algeria               25               0            14                     0.7
## 4 Andorra              245             138           312                    12.4
## 5 Angola               217              57            45                     5.9

After reading the help file by running ?drinks, you’ll see that drinks is a data frame containing results from a survey of the average number of servings of beer, spirits, and wine consumed in 193 countries. This data was originally reported on FiveThirtyEight.com in Mona Chalabi’s article: “Dear Mona Followup: Where Do People Drink The Most Beer, Wine And Spirits?”.

Let’s apply some of the data wrangling verbs we learned in Chapter @ref(wrangling) on the drinks data frame:

  1. filter() the drinks data frame to only consider 4 countries: the United States, China, Italy, and Saudi Arabia, then
  2. select() all columns except total_litres_of_pure_alcohol by using the - sign, then
  3. rename() the variables beer_servings, spirit_servings, and wine_servings to beer, spirit, and wine, respectively.

and save the resulting data frame in drinks_smaller:

drinks_smaller <- drinks %>% 
  filter(country %in% c("USA", "China", "Italy", "Saudi Arabia")) %>% 
  select(-total_litres_of_pure_alcohol) %>% 
  rename(beer = beer_servings, spirit = spirit_servings, wine = wine_servings)
drinks_smaller
## # A tibble: 4 x 4
##   country       beer spirit  wine
##   <chr>        <int>  <int> <int>
## 1 China           79    192     8
## 2 Italy           85     42   237
## 3 Saudi Arabia     0      5     0
## 4 USA            249    158    84

Let’s now ask ourselves a question: “Using the drinks_smaller data frame, how would we create the side-by-side barplot in Figure @ref(fig:drinks-smaller)?”. Recall we saw barplots displaying two categorical variables in Subsection @ref(two-categ-barplot).

Comparing alcohol consumption in 4 countries.

Comparing alcohol consumption in 4 countries.

Let’s break down the grammar of graphics we introduced in Section @ref(grammarofgraphics):

  1. The categorical variable country with four levels (China, Italy, Saudi Arabia, USA) would have to be mapped to the x-position of the bars.
  2. The numerical variable servings would have to be mapped to the y-position of the bars (the height of the bars).
  3. The categorical variable type with three levels (beer, spirit, wine) would have to be mapped to the fill color of the bars.

Observe, however, that drinks_smaller has three separate variables beer, spirit, and wine. In order to use the ggplot() function to recreate the barplot in Figure @ref(fig:drinks-smaller). However, we need a single variable type with three possible values: beer, spirit, and wine. We could then map this type variable to the fill aesthetic of our plot. In other words, to recreate the barplot in Figure @ref(fig:drinks-smaller), our data frame would have to look like this:

drinks_smaller_tidy
## # A tibble: 12 x 3
##    country      type   servings
##    <chr>        <chr>     <int>
##  1 China        beer         79
##  2 Italy        beer         85
##  3 Saudi Arabia beer          0
##  4 USA          beer        249
##  5 China        spirit      192
##  6 Italy        spirit       42
##  7 Saudi Arabia spirit        5
##  8 USA          spirit      158
##  9 China        wine          8
## 10 Italy        wine        237
## 11 Saudi Arabia wine          0
## 12 USA          wine         84

Observe that while drinks_smaller and drinks_smaller_tidy are both rectangular in shape and contain the same 12 numerical values (3 alcohol types by 4 countries), they are formatted differently. drinks_smaller is formatted in what’s known as “wide” format, whereas drinks_smaller_tidy is formatted in what’s known as “long/narrow” format.

In the context of doing data science in R, long/narrow format is also known as “tidy” format. In order to use the ggplot2 and dplyr packages for data visualization and data wrangling, your input data frames must be in “tidy” format. Thus, all non-“tidy” data must be converted to “tidy” format first. Before we convert non-“tidy” data frames like drinks_smaller to “tidy” data frames like drinks_smaller_tidy, let’s define “tidy” data.

Definition of “tidy” data

You have surely heard the word “tidy” in your life:

What does it mean for your data to be “tidy”? While “tidy” has a clear English meaning of “organized,” the word “tidy” in data science using R means that your data follows a standardized format. We will follow Hadley Wickham’s definition of “tidy” data [@tidy] shown also in Figure @ref(fig:tidyfig):

A dataset is a collection of values, usually either numbers (if quantitative) or strings AKA text data (if qualitative/categorical). Values are organised in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a city) across attributes.

“Tidy” data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

(ref:tidy-r4ds) Tidy data graphic from R for Data Science.

For example, say you have the following table of stock prices in Table @ref(tab:non-tidy-stocks):

Stock prices (non-tidy format)
Date Boeing stock price Amazon stock price Google stock price
2009-01-01 $173.55 $174.90 $174.34
2009-01-02 $172.61 $171.42 $170.04

Although the data are neatly organized in a rectangular spreadsheet-type format, they do not follow the definition of data in “tidy” format. While there are three variables corresponding to three unique pieces of information (date, stock name, and stock price), there are not three columns. In “tidy” data format, each variable should be its own column, as shown in Table @ref(tab:tidy-stocks). Notice that both tables present the same information, but in different formats.

Stock prices (tidy format)
Date Stock Name Stock Price
2009-01-01 Boeing $173.55
2009-01-01 Amazon $174.90
2009-01-01 Google $174.34
2009-01-02 Boeing $172.61
2009-01-02 Amazon $171.42
2009-01-02 Google $170.04

Now we have the requisite three columns Date, Stock Name, and Stock Price. On the other hand, consider the data in Table @ref(tab:tidy-stocks-2).

Example of tidy data
Date Boeing Price Weather
2009-01-01 $173.55 Sunny
2009-01-02 $172.61 Overcast

In this case, even though the variable “Boeing Price” occurs just like in our non-“tidy” data in Table @ref(tab:non-tidy-stocks), the data is “tidy” since there are three variables corresponding to three unique pieces of information: Date, Boeing price, and the Weather that particular day.

Converting to “tidy” data

In this book so far, you’ve only seen data frames that were already in “tidy” format. Furthermore, for the rest of this book, you’ll mostly only see data frames that are already in “tidy” format as well. This is not always the case however with all datasets in the world. If your original data frame is in wide (non-“tidy”) format and you would like to use the ggplot2 or dplyr packages, you will first have to convert it to “tidy” format. To do so, we recommend using the pivot_longer() function in the tidyr package [@R-tidyr].

Going back to our drinks_smaller data frame from earlier:

drinks_smaller
## # A tibble: 4 x 4
##   country       beer spirit  wine
##   <chr>        <int>  <int> <int>
## 1 China           79    192     8
## 2 Italy           85     42   237
## 3 Saudi Arabia     0      5     0
## 4 USA            249    158    84

We convert it to “tidy” format by using the pivot_longer() function from the tidyr package as follows:

drinks_smaller_tidy <- drinks_smaller %>% 
  pivot_longer(names_to = "type", 
               values_to = "servings", 
               cols = -country)
drinks_smaller_tidy
## # A tibble: 12 x 3
##    country      type   servings
##    <chr>        <chr>     <int>
##  1 China        beer         79
##  2 China        spirit      192
##  3 China        wine          8
##  4 Italy        beer         85
##  5 Italy        spirit       42
##  6 Italy        wine        237
##  7 Saudi Arabia beer          0
##  8 Saudi Arabia spirit        5
##  9 Saudi Arabia wine          0
## 10 USA          beer        249
## 11 USA          spirit      158
## 12 USA          wine         84

We set the arguments to pivot_longer() as follows:

  1. names_to here corresponds to the name of the variable in the new “tidy”/long data frame that will contain the column names of the original data. Observe how we set names_to = "type". In the resulting drinks_smaller_tidy, the column type contains the three types of alcohol beer, spirit, and wine. Since type is a variable name that doesn’t appear in drinks_smaller, we use quotation marks around it. You’ll receive an error if you just use names_to = type here.
  2. values_to here is the name of the variable in the new “tidy” data frame that will contain the values of the original data. Observe how we set values_to = "servings" since each of the numeric values in each of the beer, wine, and spirit columns of the drinks_smaller data corresponds to a value of servings. In the resulting drinks_smaller_tidy, the column servings contains the 4 \(\times\) 3 = 12 numerical values. Note again that servings doesn’t appear as a variable in drinks_smaller so it again needs quotation marks around it for the values_to argument.
  3. The third argument cols is the columns in the drinks_smaller data frame you either want to or don’t want to “tidy.” Observe how we set this to -country indicating that we don’t want to “tidy” the country variable in drinks_smaller and rather only beer, spirit, and wine. Since country is a column that appears in drinks_smaller we don’t put quotation marks around it.

The third argument here of cols is a little nuanced, so let’s consider code that’s written slightly differently but that produces the same output:

drinks_smaller %>% 
  pivot_longer(names_to = "type", 
               values_to = "servings", 
               cols = c(beer, spirit, wine))

Note that the third argument now specifies which columns we want to “tidy” with c(beer, spirit, wine), instead of the columns we don’t want to “tidy” using -country. We use the c() function to create a vector of the columns in drinks_smaller that we’d like to “tidy.” Note that since these three columns appear one after another in the drinks_smaller data frame, we could also do the following for the cols argument:

drinks_smaller %>% 
  pivot_longer(names_to = "type", 
               values_to = "servings", 
               cols = beer:wine)

With our drinks_smaller_tidy “tidy” formatted data frame, we can now produce the barplot you saw in Figure @ref(fig:drinks-smaller) using geom_col(). This is done in Figure @ref(fig:drinks-smaller-tidy-barplot). Recall from Section @ref(geombar) on barplots that we use geom_col() and not geom_bar(), since we would like to map the “pre-counted” servings variable to the y-aesthetic of the bars.

ggplot(drinks_smaller_tidy, aes(x = country, y = servings, fill = type)) +
  geom_col(position = "dodge")

(ref:drinks-col) Comparing alcohol consumption in 4 countries using geom_col().

(ref:drinks-col)

(ref:drinks-col)

Converting “wide” format data to “tidy” format often confuses new R users. The only way to learn to get comfortable with the pivot_longer() function is with practice, practice, and more practice using different datasets. For example, run ?pivot_longer and look at the examples in the bottom of the help file. We’ll show another example of using pivot_longer() to convert a “wide” formatted data frame to “tidy” format in Section @ref(case-study-tidy).

If however you want to convert a “tidy” data frame to “wide” format, you will need to use the pivot_wider() function instead. Run ?pivot_wider and look at the examples in the bottom of the help file for examples.

You can also view examples of both pivot_longer() and pivot_wider() on the tidyverse.org webpage. There’s a nice example to check out the different functions available for data tidying and a case study using data from the World Health Organization on that webpage. Furthermore, each week the R4DS Online Learning Community posts a dataset in the weekly #TidyTuesday event that might serve as a nice place for you to find other data to explore and transform.

nycflights13 package

Recall the nycflights13 package we introduced in Section @ref(nycflights13) with data about all domestic flights departing from New York City in 2013. Let’s revisit the flights data frame by running View(flights). We saw that flights has a rectangular shape, with each of its 336,776 rows corresponding to a flight and each of its 22 columns corresponding to different characteristics/measurements of each flight. This satisfied the first two criteria of the definition of “tidy” data from Subsection @ref(tidy-definition): that “Each variable forms a column” and “Each observation forms a row.” But what about the third property of “tidy” data that “Each type of observational unit forms a table”?

Recall that we saw in Subsection @ref(exploredataframes) that the observational unit for the flights data frame is an individual flight. In other words, the rows of the flights data frame refer to characteristics/measurements of individual flights. Also included in the nycflights13 package are other data frames with their rows representing different observational units [@R-nycflights13]:

  • airlines: translation between two letter IATA carrier codes and airline company names (16 in total). The observational unit is an airline company.
  • planes: aircraft information about each of 3,322 planes used, i.e., the observational unit is an aircraft.
  • weather: hourly meteorological data (about 8,705 observations) for each of the three NYC airports, i.e., the observational unit is an hourly measurement of weather at one of the three airports.
  • airports: airport names and locations. The observational unit is an airport.

The organization of the information into these five data frames follows the third “tidy” data property: observations corresponding to the same observational unit should be saved in the same table, i.e., data frame. You could think of this property as the old English expression: “birds of a feather flock together.”

Case study: Democracy in Guatemala

In this section, we’ll show you another example of how to convert a data frame that isn’t in “tidy” format (“wide” format) to a data frame that is in “tidy” format (“long/narrow” format). We’ll do this using the pivot_longer() function from the tidyr package again.

Furthermore, we’ll make use of functions from the ggplot2 and dplyr packages to produce a time-series plot showing how the democracy scores have changed over the 40 years from 1952 to 1992 for Guatemala. Recall that we saw time-series plots in Section @ref(linegraphs) on creating linegraphs using geom_line().

Let’s use the dem_score data frame we imported in Section @ref(csv), but focus on only data corresponding to Guatemala.

guat_dem <- dem_score %>% 
  filter(country == "Guatemala")
guat_dem
## # A tibble: 1 x 10
##   country   `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992`
##   <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Guatemala      2     -6     -5      3      1     -3     -7      3      3

Let’s lay out the grammar of graphics we saw in Section @ref(grammarofgraphics).

First we know we need to set data = guat_dem and use a geom_line() layer, but what is the aesthetic mapping of variables? We’d like to see how the democracy score has changed over the years, so we need to map:

  • year to the x-position aesthetic and
  • democracy_score to the y-position aesthetic

Now we are stuck in a predicament, much like with our drinks_smaller example in Section @ref(tidy-data-ex). We see that we have a variable named country, but its only value is "Guatemala". We have other variables denoted by different year values. Unfortunately, the guat_dem data frame is not “tidy” and hence is not in the appropriate format to apply the grammar of graphics, and thus we cannot use the ggplot2 package just yet.

We need to take the values of the columns corresponding to years in guat_dem and convert them into a new “names” variable called year. Furthermore, we need to take the democracy score values in the inside of the data frame and turn them into a new “values” variable called democracy_score. Our resulting data frame will have three columns: country, year, and democracy_score. Recall that the pivot_longer() function in the tidyr package does this for us:

guat_dem_tidy <- guat_dem %>% 
  pivot_longer(names_to = "year", 
               values_to = "democracy_score", 
               cols = -country,
               names_ptypes = list(year = integer())) 
guat_dem_tidy
## # A tibble: 9 x 3
##   country    year democracy_score
##   <chr>     <int>           <dbl>
## 1 Guatemala  1952               2
## 2 Guatemala  1957              -6
## 3 Guatemala  1962              -5
## 4 Guatemala  1967               3
## 5 Guatemala  1972               1
## 6 Guatemala  1977              -3
## 7 Guatemala  1982              -7
## 8 Guatemala  1987               3
## 9 Guatemala  1992               3

We set the arguments to pivot_longer() as follows:

  1. names_to is the name of the variable in the new “tidy” data frame that will contain the column names of the original data. Observe how we set names_to = "year". In the resulting guat_dem_tidy, the column year contains the years where Guatemala’s democracy scores were measured.
  2. values_to is the name of the variable in the new “tidy” data frame that will contain the values of the original data. Observe how we set values_to = "democracy_score". In the resulting guat_dem_tidy the column democracy_score contains the 1 \(\times\) 9 = 9 democracy scores as numeric values.
  3. The third argument is the columns you either want to or don’t want to “tidy.” Observe how we set this to cols = -country indicating that we don’t want to “tidy” the country variable in guat_dem and rather only variables 1952 through 1992.
  4. The last argument of names_ptypes tells R what type of variable year should be set to. Without specifying that it is an integer as we’ve done here, pivot_longer() will set it to be a character value by default.

We can now create the time-series plot in Figure @ref(fig:guat-dem-tidy) to visualize how democracy scores in Guatemala have changed from 1952 to 1992 using a geom_line(). Furthermore, we’ll use the labs() function in the ggplot2 package to add informative labels to all the aes()thetic attributes of our plot, in this case the x and y positions.

ggplot(guat_dem_tidy, aes(x = year, y = democracy_score)) +
  geom_line() +
  labs(x = "Year", y = "Democracy Score")
Democracy scores in Guatemala 1952-1992.

Democracy scores in Guatemala 1952-1992.

Note that if we forgot to include the names_ptypes argument specifying that year was not of character format, we would have gotten an error here since geom_line() wouldn’t have known how to sort the character values in year in the right order.

tidyverse package

Notice at the beginning of the chapter we loaded the following four packages, which are among four of the most frequently used R packages for data science:

library(ggplot2)
library(dplyr)
library(readr)
library(tidyr)

Recall that ggplot2 is for data visualization, dplyr is for data wrangling, readr is for importing spreadsheet data into R, and tidyr is for converting data to “tidy” format. There is a much quicker way to load these packages than by individually loading them: by installing and loading the tidyverse package. The tidyverse package acts as an “umbrella” package whereby installing/loading it will install/load multiple packages at once for you.

After installing the tidyverse package as you would a normal package as seen in Section @ref(packages), running:

library(tidyverse)

would be the same as running:

library(ggplot2)
library(dplyr)
library(readr)
library(tidyr)
library(purrr)
library(tibble)
library(stringr)
library(forcats)

The purrr, tibble, stringr, and forcats are left for a more advanced book; check out R for Data Science to learn about these packages.

For the remainder of this book, we’ll start every chapter by running library(tidyverse), instead of loading the various component packages individually. The tidyverse “umbrella” package gets its name from the fact that all the functions in all its packages are designed to have common inputs and outputs: data frames are in “tidy” format. This standardization of input and output data frames makes transitions between different functions in the different packages as seamless as possible. For more information, check out the tidyverse.org webpage for the package.

Conclusion

Additional resources

If you want to learn more about using the readr and tidyr package, we suggest that you check out RStudio’s “Data Import Cheat Sheet.” In the current version of RStudio in late 2019, you can access this cheatsheet by going to the RStudio Menu Bar -> Help -> Cheatsheets -> “Browse Cheatsheets” -> Scroll down the page to the “Data Import Cheat Sheet.” The first page of this cheatsheet has information on using the readr package to import data, while the second page has information on using the tidyr package to “tidy” data. You can see a preview of both cheatsheets in the figures below.

Data Import cheatsheet (first page): readr package.

Data Import cheatsheet (first page): readr package.

Data Import cheatsheet (second page): tidyr package.

Data Import cheatsheet (second page): tidyr package.


  1. http://selectorgadget.com/↩︎

  2. https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html↩︎

  3. https://stat4701.github.io/edav/2015/04/02/rvest_tutorial/↩︎

  4. https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/↩︎